2. About Me
Gil Irizarry - Director of Engineering for name
and identity resolution technology
Basis Technology - leading provider of software
solutions for extracting meaningful intelligence
from multilingual text and digital devices
3. Names
What's in a name? that which we call a rose
By any other name would smell as sweet;
So Romeo would, were he not Romeo call'd,
Retain that dear perfection which he owes
Without that title.
Romeo and Juliet, Act 2, Scene 2
6. Lies We Believe About Names
People have exactly one canonical full name.
People have exactly one full name which they go by.
People have, at this point in time, exactly one canonical full name.
People have, at this point in time, one full name which they go by.
People have exactly N names, for any value of N.
People’s names fit within a certain defined amount of space.
People’s names do not change.
People’s names change, but only at a certain enumerated set of events.
People’s names are written in ASCII.
People’s names are written in any single character set.
People’s names are all mapped in Unicode code points.
People’s names are case sensitive.
People’s names are case insensitive.
People’s names sometimes have prefixes or suffixes, but you can safely ignore those.
People’s names do not contain numbers.
People’s names are not written in ALL CAPS.
People’s names are not written in all lower case letters.
People’s names have an order to them. Picking any ordering scheme will automatically result in consistent
ordering among all systems, as long as both use the same ordering scheme for the same name.
People’s first names and last names are, by necessity, different.
...and more…
https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-na
mes/
7. A True Story
Mícheál MacDonncha
vs.
Ardmhéara Micheál MacDonncha
(vs. Micheál Mac Donncha)
https://www.irishpost.com/news/
misspelling-irish-lord-mayors-name
-leads-mishap-trip-israel-153193
8. Imagine That You Need To Enter Your Name
https://www.basistech.com/case-study/nyu-names-search/
9. Imagine we have a backend datastore of identities...
● This identity directory is stored in Elasticsearch. (Why
Elasticsearch? Because it has fuzzy search capability)
● To access data, you have to enter a name
● The terminal asks for Last Name, First Name (and the
system stores your name as Smith, John Armstrong)
● However, the user enters:
○ John Smith
○ John A. Smith
○ John Armstrong Smith
11. Now imagine a user calls into a call center...
● User calls into a call center
● An operator hears the user’s
name and transcribes it
● This is prone to errors
● The operator enters
○ Jon Smyth
○ John ArmstrongSmith
○ John Armstrong-Smith
13. Suppose Other Phenomena Are Entered
● Nicknames
○ Johnny Smith
● Gender mistakes
○ Joan Smith
○ Joanie Smith
● Differences in relative name frequencies
○ Shaun Smith
● Initials, especially for organization names
○ IBM vs. International Business Machines
19. A Challenge Fulfilled
=문재인
Moon Jae-in in
Hangul
Hangul: Korean Alphabet
文在寅
Moon Jae-in
in Hanja
Hanja: Chinese characters
with Korean pronunciation
20. Conclusions
1. Matching text strings is straightforward; matching
names is not.
2. Text strings comprised of different characters
may have the same social or cultural meaning.
21. Conclusions
3. The situation gets more complex when combining a
name with other information, such as date of birth
or address. These types of data also have multiple
formats and are prone to transcription errors.