SlideShare una empresa de Scribd logo
1 de 46
An Open-source Similar-name Finder Dallan Quass  [email_address]
What's the problem?
People can't spell unusual names Maybe a piece of mail addressed to Solverg Quast? Solverg Quast 5934 Phoenix Ave. Shoreview, MN 55126 Johnston Bros. 1256 Bristol St. Mapleton, MN 55126 Should be:  Solveig Quass
People use nicknames John Johnny Jack
Transcribers make typos Jhon
Most of our ancestors didn't know how to read or write  signature
What does it matter?
How do you find records? Johnny Snith John Smith
How do you match people? John Smith Johnny Smithe
Not a new problem
Lots of solutions Soundex Nysiis Double Metaphone Refined Soundex Daitch-Mokotoff Caverphone Levenstein Jaro Winkler Monge Elkan Needleman Wunch Smith Waterman
No Bullseye
Why is this so hard?
How similar are two names? We’re neighbors John Jonny Joe I don’t know those guys
First approach: Coders Soundex Nysiis Double Metaphone Refined Soundex Daitch-Mokotoff Caverphone ,[object Object],[object Object],[object Object],[object Object]
First approach: Coders Jim John Jane Johan Johannes
Second approach: Distance functions Levenstein Jaro Winkler Monge Elkan Needleman Wunch Smith Waterman ,[object Object],[object Object],[object Object],[object Object]
Second approach: Distance functions Jim John Jane Johan Johannes Better results, but Doesn't scale well
Can we do better?
Warning: Machine learning ahead!
Thank you Ancestry! ,[object Object],[object Object],[object Object]
A closer look at Levenstein Jon John Bohn -1 -1
Maximize your expectations ,[object Object],[object Object],[object Object],[object Object],Jon John Bohn high cost low cost Weighted  Edit Distance
Learn to classify ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Wait, i sn't this just another distance function? Distance functions don't scale, right?
Right
Back to the basics x  f(x) -5  -1 -3  4.5 0  7 2  3.5 4  2
Long tail
Long tail 200,000 Surnames  70,000 Given names ≤   1/5,000,000 names
Long tail Use distance function with table here Use coder here
Result: Table initialized by a function Dallan:  Dalana Daleen Dalen Dalin … Talan Tallon Ryan:  Aaran Aran Arrin … Rian Riana ...
A nice thing about tables... Dallan:  Dalana Daleen Dalen Dalin … Talan Tallon Ryan:  Aaran Aran Arrin … Rian Riana ...
Add to the table Nicknames BehindTheName.com The New American Dictionary of Baby Names by Leslie Dunking and William Gosling A Dictionary of Surnames by Patrick Hanks and Flavia Hodges WeRelate community
Thank you BehindTheName.com! Fascinating  Family Trees for given names
Result Soundex Our approach Precision  Recall 28% decrease in false negatives Given names Soundex Our approach Precision  Recall 28% decrease in false negatives Surnames 97 65 97 74 89 68 89 77
Who is using it?
WeRelate.org
Continuous improvement
Continuous improvement
Community oversight
How do I use it? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Roadmap ,[object Object],[object Object],[object Object],[object Object]
Future work
Future work ,[object Object],[object Object],[object Object],[object Object],[object Object],Remove “chaff” variants from common names
Conclusion Images appearing on these slides are copyrighted by the contributors to  http://commons.wikimedia.org and are used under license ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 

Más contenido relacionado

Similar a An Open-source Similar-name Finder

Natural Language Visualization with Scattertext
Natural Language Visualization with ScattertextNatural Language Visualization with Scattertext
Natural Language Visualization with ScattertextJason Kessler
 
The Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer SimulationThe Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer SimulationRichard Littauer
 
Visualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextVisualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextJason Kessler
 
Redevelop 2019 - Debugging our biases and intuition in software development
Redevelop 2019 - Debugging our biases and intuition in software developmentRedevelop 2019 - Debugging our biases and intuition in software development
Redevelop 2019 - Debugging our biases and intuition in software developmentDave Hulbert
 
Aac and mowat wilson syndrome 0617
Aac and mowat wilson syndrome 0617Aac and mowat wilson syndrome 0617
Aac and mowat wilson syndrome 0617Kim Singleton
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizersHa Loc Do
 
Articulation Chapter From Previous Book
Articulation Chapter From Previous BookArticulation Chapter From Previous Book
Articulation Chapter From Previous Bookguest2dd347
 
Literacy 2.0
Literacy 2.0Literacy 2.0
Literacy 2.0nmangum
 

Similar a An Open-source Similar-name Finder (11)

Natural Language Visualization with Scattertext
Natural Language Visualization with ScattertextNatural Language Visualization with Scattertext
Natural Language Visualization with Scattertext
 
The Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer SimulationThe Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer Simulation
 
Visualizing Words and Topics with Scattertext
Visualizing Words and Topics with ScattertextVisualizing Words and Topics with Scattertext
Visualizing Words and Topics with Scattertext
 
Redevelop 2019 - Debugging our biases and intuition in software development
Redevelop 2019 - Debugging our biases and intuition in software developmentRedevelop 2019 - Debugging our biases and intuition in software development
Redevelop 2019 - Debugging our biases and intuition in software development
 
Aac and mowat wilson syndrome 0617
Aac and mowat wilson syndrome 0617Aac and mowat wilson syndrome 0617
Aac and mowat wilson syndrome 0617
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizers
 
The system sound and listening
The system sound and listeningThe system sound and listening
The system sound and listening
 
Articulation Chapter From Previous Book
Articulation Chapter From Previous BookArticulation Chapter From Previous Book
Articulation Chapter From Previous Book
 
Class14
Class14Class14
Class14
 
Literacy 2.0
Literacy 2.0Literacy 2.0
Literacy 2.0
 
Language
LanguageLanguage
Language
 

Más de Dallan Quass

FamilySearch Javascript SDK
FamilySearch Javascript SDKFamilySearch Javascript SDK
FamilySearch Javascript SDKDallan Quass
 
FamilySearch Reference Client
FamilySearch Reference ClientFamilySearch Reference Client
FamilySearch Reference ClientDallan Quass
 
Using WeRelate.org (2009)
Using WeRelate.org (2009)Using WeRelate.org (2009)
Using WeRelate.org (2009)Dallan Quass
 
WeRelate.org flyer (2010)
WeRelate.org flyer (2010)WeRelate.org flyer (2010)
WeRelate.org flyer (2010)Dallan Quass
 
Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)Dallan Quass
 
An Open-source Place-finder for Genealogy
An Open-source Place-finder for GenealogyAn Open-source Place-finder for Genealogy
An Open-source Place-finder for GenealogyDallan Quass
 
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM ParserA Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM ParserDallan Quass
 

Más de Dallan Quass (7)

FamilySearch Javascript SDK
FamilySearch Javascript SDKFamilySearch Javascript SDK
FamilySearch Javascript SDK
 
FamilySearch Reference Client
FamilySearch Reference ClientFamilySearch Reference Client
FamilySearch Reference Client
 
Using WeRelate.org (2009)
Using WeRelate.org (2009)Using WeRelate.org (2009)
Using WeRelate.org (2009)
 
WeRelate.org flyer (2010)
WeRelate.org flyer (2010)WeRelate.org flyer (2010)
WeRelate.org flyer (2010)
 
Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)Why share your genealogy content on WeRelate.org (2009)
Why share your genealogy content on WeRelate.org (2009)
 
An Open-source Place-finder for Genealogy
An Open-source Place-finder for GenealogyAn Open-source Place-finder for Genealogy
An Open-source Place-finder for Genealogy
 
A Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM ParserA Robust Open-source GEDCOM Parser
A Robust Open-source GEDCOM Parser
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

An Open-source Similar-name Finder