SlideShare una empresa de Scribd logo
1 de 37
The Data Bath
Rich Dudley
http://rjdudley.com
rich@rjdudley.com
@rj_dudley
For those about to rock…
I totally ate the first ones!
(from Weight Watchers mobile app)
To err is a conundrum
• University of Arkansas Little Rock
Center for Advanced Research in
Entity Resolution and
Information Quality (ERIQ)
• http://pewee.ws/6q
The Talburt Book
Why so important today?
• 1993: Name and mailing address
• 2013: Facebook Username, Twitter Id, Email, multiple phone #,
multiple mailing addresses
• EMS renaming/numbering streets
• If the service is free, you’re the product!
Barriers to BI Adoption
“Research: 2012 BI and Information Management”, Information Week, http://pewee.ws/6y
Talburt’s Five Activities of ER
• ERA1: Entity reference extraction
• ERA2: Entity reference preparation
• ERA3: Entity reference resolution
• ERA4: Entity identity management
• ERA5: Entity association analysis
Rich’s Five (Simplified) Activities of ER
• ERA1: Get your data in one place
• ERA2: Information quality - scrub your data
• ERA3: Entity resolution - match and link your data
• ERA4: Identity resolution - making the unknown, known
• ERA5: How are identities related to one another?
ERA2: Information Quality
“The degree of completeness, accuracy, timeliness, believability, consistency,
accessibility and other aspects of reference data can affect the operation of ER
processes and produce better or worse outcomes.”
Garbage In, Garbage Out
Information Quality
• Data quality
• Master data management
• Data governance
• What’s sauce for the goose may not be sauce for the gander
• IQ management is not just the responsibility of the IT department
• Success depends on demonstrating business value and having
support at the top
• Encoding
• UTF-8 to ASCII
• Conversion
• varchar to nvarchar
• Standardization
• “123 Main Street” to “123 Main
ST.”
• Correction
• Confirming ZIP codes and area
codes
• Bucketing
• Age 25-24 = Cohort A
• Age 35-44 = Cohort B
• Bursting
• “123 Main Street” into “123”,
“Main”, “ST.”
• Validation (are the data rational)
• Patient’s body temp was not 350F
• Enhancement (adding additional
information)
• Geocoding addresses
Scrub-a-dub-dub
Common Data Quality Issues
“Master Data and Data Quality Management in SQL Server 2012”, Mark Gschwind, http://pewee.ws/6z
ERA3: Entity Resolution
“Entity resolution is the process of probabilistically identifying some real thing
based on a set of possibly ambiguous clues”
Baseline
• An entity is a real world object (person, place or thing)
• Resolution is deciding if two references point to the same or different
entities
• An identity is a known entity
• Identity attributes are a set of defining characteristics that when
taken together distinguish one entity from another
• Cars – same make, model, color, VIN differentiates
What Distinguishes Identities
Rich Dudley
… or is it?September 28, 1967
DNA is the ultimate
identity attribute…
Demo: R-Swoosh
Using direct matching
R-Swoosh
ID FirstName LastName Address City State Zip Source
1 Hope Solo 123 W. Main St. Springfield RI 12345 FB
2 Han Solo 1600 Pennsylvania Ave Springfield RI 12345 LI
3 Jenny Jani 867 E Rt 5309 Springfield UT 76543 Raffle
4 H Solo 123 West Main Street Springfield RI 12345 Local
5 Jenny Jani 867 East Rte 5309 Springfield UT 76543 FB
6 Arthur Dent 42 Vogon Bypass Springfield IL 60543 FB
7 Jenni Jani 867 East Rte 5309 Springfield UT 76543 Local
ID FirstName LastName Address City State Zip Source
1
D
ER(R)
R-Swoosh: Match Functions
• Is M(x,y) == true?
• M(x.FirstName === y.FirstName)
• M(x.LastName === y.LastName)
• …
1 Hope Solo 123 W. Main St. Springfield RI 12345 FB
1
X
Y
1 Hope Solo 123 W. Main St. Springfield RI 12345 FB
1X
Y
R-Swoosh: Lather, Rinse, Repeat
1 Hope Solo 123 W. Main St. Springfield RI 12345 FB
2 Han Solo 1600 Pennsylvania Ave Springfield RI 12345 LIX
Y
1 Hope Solo 123 W. Main St. Springfield RI 12345 FB
2 Han Solo 1600 Pennsylvania Ave Springfield RI 12345 LI
3 Jenny Jani 867 E Rt 5309 Springfield UT 76543 RaffleX
Y
1 Hope Solo 123 W. Main St. Springfield RI 12345 FB
2 Han Solo 1600 Pennsylvania Ave Springfield RI 12345 LI
3 Jenny Jani 867 E Rt 5309 Springfield UT 76543 Raffle
4 H Solo 123 West Main Street Springfield RI 12345 LocalX
Y
Determining equivalence
• Direct matching
• Transitive equivalence
• Association analysis
• Asserted equivalence
Direct Matching
• Exact match (deterministic matching)
• John Doe != JOHN DOE
• Success depends on ERA2
• Numerical difference
• How close can the numbers be? Server time differences
• Approximate syntactic match
• Approximate String Matching (ASM)
• SimMetrics library
• Approximate semantic
• Synonym tables
• “Towing and Recovery” = “Wrecker service”
• Derived match codes
• Hash codes
• Soundex
• Blocking
Demo: SimMetrics Library
18 algorithms
Java, C# and SQL Server
Results are p values
…Lloyds… …Loyds…
Demo: Derived Match Codes
Soundex
• Replace consonants with digits as follows (but do not change the first letter):
• b, f, p, v => 1
• c, g, j, k, q, s, x, z => 2
• d, t => 3
• l => 4
• m, n => 5
• r => 6
• Collapse adjacent identical digits into a single digit of that value.
• Remove all non-digits after the first letter.
• Return the starting letter and the first three remaining digits. If needed, append zeroes to make it a
letter and three digits.
Blocking Strategy
• Blocking = partial match codes, using small blocks of the overall data
• Real life example
• ZIP code = 16001
• Previous house owner = Kathryn “Kathy” Davis
• Current: Kathleen “Kathy” Dudley
• Blocking Code = KathD16001
• Result: After 10 years, we still get all kinds of solicitations for Kathryn Davis
• Photo of blocked junk mail address label?
Transitive Equivalence
• Relies on intermediate entities, so several passes may be necessary
1. Mary Smith, 123 Main St, 555-1234
2. Mary Smith, 456 Elm St, 555-1234
3. Mary Smith, 456 Elm St, null
• 1 != 3, but 1 == 2 and 2 == 3 :: 1 == 3
Association Analysis
• Considering multiple relationships and making multiple decisions at
the same time
Name Address
Mary Smith 123 Oak St
Mary Smith 456 Elm St
John Smith 123 Oak St
John Smith 456 Elm St
Asserted Equivalence
• Knowledge-based linking, “we already know these are the same”
• Cassius Clay, 1960 Gold Medal, Boxing (Light Heavyweight)
• Muhammad Ali, 1964 World Champion, Boxing (Heavyweight)
• MS Carriers, purchased by Swift Transportation
• We don’t want to recalculate historical data, but we need a link to the
current identity
Putting it all together
• Show match functions of different natures
• Bursted + standardized address
• Blocked name+zip – SimMetrics?
1 Hope Solo 123 West Main St Springfield RI 12345 HopeSolo12345
4 H Solo 123 West Main St Springfield RI 12345 HSolo12345X
Y
Summary
• Information quality and entity resolution go hand-in-hand
• Both are growing in importance
• Identity attributes and level of resolution vary from need to need
• R-Swoosh + direct matching techniques are fairly universal
References
• Entity Resolution and Information Quality, John R. Talburt, Amazon
link: http://pewee.ws/6q
• How to solve common Data Quality Problems using Data Quality
Services (Part 1), Paras Doshi, http://pewee.ws/6t
• “Research: 2012 BI and Information Management”, Douglas
Henschen, Information Week, http://pewee.ws/6y
• “Master Data and Data Quality Management in SQL Server 2012”,
Mark Gschwind, http://pewee.ws/6z
SimMetrics References
• Java and C# (some ported to F#)
• http://sourceforge.net/projects/simmetrics/files/
• Installing into SQL Server
• http://anastasiosyal.com/POST/2009/01/11/18.ASPX
• Originator's current page:
• http://www.aktors.org/technologies/simmetrics/
• Archives of original project pages:
• http://web.archive.org/web/20081230184321/http://www.dcs.shef.ac.uk/~s
am/simmetrics.html
• http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~s
am/stringmetrics.html
The data bath

Más contenido relacionado

Similar a The data bath

Clients in control: building demand-driven systems with Om Next
Clients in control: building demand-driven systems with Om NextClients in control: building demand-driven systems with Om Next
Clients in control: building demand-driven systems with Om NextAntónio Monteiro
 
Data Quality for AML
Data Quality for AMLData Quality for AML
Data Quality for AMLPrecisely
 
2014 Family Search Developer Conference Place 2.0 API
2014 Family Search Developer Conference Place 2.0 API2014 Family Search Developer Conference Place 2.0 API
2014 Family Search Developer Conference Place 2.0 APIdshellman
 
Internet of things and their requirements.
Internet of things and their requirements.Internet of things and their requirements.
Internet of things and their requirements.Klearchos Klearchou
 
The Wisconsin NACO Funnel
The Wisconsin NACO FunnelThe Wisconsin NACO Funnel
The Wisconsin NACO FunnelWiLS
 
Neo4j Data Science Presentation
Neo4j Data Science PresentationNeo4j Data Science Presentation
Neo4j Data Science PresentationMax De Marzi
 
Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...
Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...
Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...Precisely
 

Similar a The data bath (9)

RDF and SPARQL
RDF and SPARQLRDF and SPARQL
RDF and SPARQL
 
VRA 2012, Cataloging Case Studies, Metadata Magic
VRA 2012, Cataloging Case Studies, Metadata MagicVRA 2012, Cataloging Case Studies, Metadata Magic
VRA 2012, Cataloging Case Studies, Metadata Magic
 
Clients in control: building demand-driven systems with Om Next
Clients in control: building demand-driven systems with Om NextClients in control: building demand-driven systems with Om Next
Clients in control: building demand-driven systems with Om Next
 
Data Quality for AML
Data Quality for AMLData Quality for AML
Data Quality for AML
 
2014 Family Search Developer Conference Place 2.0 API
2014 Family Search Developer Conference Place 2.0 API2014 Family Search Developer Conference Place 2.0 API
2014 Family Search Developer Conference Place 2.0 API
 
Internet of things and their requirements.
Internet of things and their requirements.Internet of things and their requirements.
Internet of things and their requirements.
 
The Wisconsin NACO Funnel
The Wisconsin NACO FunnelThe Wisconsin NACO Funnel
The Wisconsin NACO Funnel
 
Neo4j Data Science Presentation
Neo4j Data Science PresentationNeo4j Data Science Presentation
Neo4j Data Science Presentation
 
Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...
Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...
Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

The data bath

  • 1. The Data Bath Rich Dudley http://rjdudley.com rich@rjdudley.com @rj_dudley
  • 2.
  • 3. For those about to rock…
  • 4. I totally ate the first ones! (from Weight Watchers mobile app)
  • 5. To err is a conundrum
  • 6. • University of Arkansas Little Rock Center for Advanced Research in Entity Resolution and Information Quality (ERIQ) • http://pewee.ws/6q The Talburt Book
  • 7. Why so important today? • 1993: Name and mailing address • 2013: Facebook Username, Twitter Id, Email, multiple phone #, multiple mailing addresses • EMS renaming/numbering streets • If the service is free, you’re the product!
  • 8. Barriers to BI Adoption “Research: 2012 BI and Information Management”, Information Week, http://pewee.ws/6y
  • 9. Talburt’s Five Activities of ER • ERA1: Entity reference extraction • ERA2: Entity reference preparation • ERA3: Entity reference resolution • ERA4: Entity identity management • ERA5: Entity association analysis
  • 10. Rich’s Five (Simplified) Activities of ER • ERA1: Get your data in one place • ERA2: Information quality - scrub your data • ERA3: Entity resolution - match and link your data • ERA4: Identity resolution - making the unknown, known • ERA5: How are identities related to one another?
  • 11. ERA2: Information Quality “The degree of completeness, accuracy, timeliness, believability, consistency, accessibility and other aspects of reference data can affect the operation of ER processes and produce better or worse outcomes.”
  • 13. Information Quality • Data quality • Master data management • Data governance • What’s sauce for the goose may not be sauce for the gander • IQ management is not just the responsibility of the IT department • Success depends on demonstrating business value and having support at the top
  • 14. • Encoding • UTF-8 to ASCII • Conversion • varchar to nvarchar • Standardization • “123 Main Street” to “123 Main ST.” • Correction • Confirming ZIP codes and area codes • Bucketing • Age 25-24 = Cohort A • Age 35-44 = Cohort B • Bursting • “123 Main Street” into “123”, “Main”, “ST.” • Validation (are the data rational) • Patient’s body temp was not 350F • Enhancement (adding additional information) • Geocoding addresses Scrub-a-dub-dub
  • 15. Common Data Quality Issues “Master Data and Data Quality Management in SQL Server 2012”, Mark Gschwind, http://pewee.ws/6z
  • 16. ERA3: Entity Resolution “Entity resolution is the process of probabilistically identifying some real thing based on a set of possibly ambiguous clues”
  • 17. Baseline • An entity is a real world object (person, place or thing) • Resolution is deciding if two references point to the same or different entities • An identity is a known entity • Identity attributes are a set of defining characteristics that when taken together distinguish one entity from another • Cars – same make, model, color, VIN differentiates
  • 18. What Distinguishes Identities Rich Dudley … or is it?September 28, 1967 DNA is the ultimate identity attribute…
  • 20. R-Swoosh ID FirstName LastName Address City State Zip Source 1 Hope Solo 123 W. Main St. Springfield RI 12345 FB 2 Han Solo 1600 Pennsylvania Ave Springfield RI 12345 LI 3 Jenny Jani 867 E Rt 5309 Springfield UT 76543 Raffle 4 H Solo 123 West Main Street Springfield RI 12345 Local 5 Jenny Jani 867 East Rte 5309 Springfield UT 76543 FB 6 Arthur Dent 42 Vogon Bypass Springfield IL 60543 FB 7 Jenni Jani 867 East Rte 5309 Springfield UT 76543 Local ID FirstName LastName Address City State Zip Source 1 D ER(R)
  • 21. R-Swoosh: Match Functions • Is M(x,y) == true? • M(x.FirstName === y.FirstName) • M(x.LastName === y.LastName) • … 1 Hope Solo 123 W. Main St. Springfield RI 12345 FB 1 X Y 1 Hope Solo 123 W. Main St. Springfield RI 12345 FB 1X Y
  • 22. R-Swoosh: Lather, Rinse, Repeat 1 Hope Solo 123 W. Main St. Springfield RI 12345 FB 2 Han Solo 1600 Pennsylvania Ave Springfield RI 12345 LIX Y 1 Hope Solo 123 W. Main St. Springfield RI 12345 FB 2 Han Solo 1600 Pennsylvania Ave Springfield RI 12345 LI 3 Jenny Jani 867 E Rt 5309 Springfield UT 76543 RaffleX Y 1 Hope Solo 123 W. Main St. Springfield RI 12345 FB 2 Han Solo 1600 Pennsylvania Ave Springfield RI 12345 LI 3 Jenny Jani 867 E Rt 5309 Springfield UT 76543 Raffle 4 H Solo 123 West Main Street Springfield RI 12345 LocalX Y
  • 23. Determining equivalence • Direct matching • Transitive equivalence • Association analysis • Asserted equivalence
  • 24. Direct Matching • Exact match (deterministic matching) • John Doe != JOHN DOE • Success depends on ERA2 • Numerical difference • How close can the numbers be? Server time differences • Approximate syntactic match • Approximate String Matching (ASM) • SimMetrics library • Approximate semantic • Synonym tables • “Towing and Recovery” = “Wrecker service” • Derived match codes • Hash codes • Soundex • Blocking
  • 25. Demo: SimMetrics Library 18 algorithms Java, C# and SQL Server Results are p values
  • 28. Soundex • Replace consonants with digits as follows (but do not change the first letter): • b, f, p, v => 1 • c, g, j, k, q, s, x, z => 2 • d, t => 3 • l => 4 • m, n => 5 • r => 6 • Collapse adjacent identical digits into a single digit of that value. • Remove all non-digits after the first letter. • Return the starting letter and the first three remaining digits. If needed, append zeroes to make it a letter and three digits.
  • 29. Blocking Strategy • Blocking = partial match codes, using small blocks of the overall data • Real life example • ZIP code = 16001 • Previous house owner = Kathryn “Kathy” Davis • Current: Kathleen “Kathy” Dudley • Blocking Code = KathD16001 • Result: After 10 years, we still get all kinds of solicitations for Kathryn Davis • Photo of blocked junk mail address label?
  • 30. Transitive Equivalence • Relies on intermediate entities, so several passes may be necessary 1. Mary Smith, 123 Main St, 555-1234 2. Mary Smith, 456 Elm St, 555-1234 3. Mary Smith, 456 Elm St, null • 1 != 3, but 1 == 2 and 2 == 3 :: 1 == 3
  • 31. Association Analysis • Considering multiple relationships and making multiple decisions at the same time Name Address Mary Smith 123 Oak St Mary Smith 456 Elm St John Smith 123 Oak St John Smith 456 Elm St
  • 32. Asserted Equivalence • Knowledge-based linking, “we already know these are the same” • Cassius Clay, 1960 Gold Medal, Boxing (Light Heavyweight) • Muhammad Ali, 1964 World Champion, Boxing (Heavyweight) • MS Carriers, purchased by Swift Transportation • We don’t want to recalculate historical data, but we need a link to the current identity
  • 33. Putting it all together • Show match functions of different natures • Bursted + standardized address • Blocked name+zip – SimMetrics? 1 Hope Solo 123 West Main St Springfield RI 12345 HopeSolo12345 4 H Solo 123 West Main St Springfield RI 12345 HSolo12345X Y
  • 34. Summary • Information quality and entity resolution go hand-in-hand • Both are growing in importance • Identity attributes and level of resolution vary from need to need • R-Swoosh + direct matching techniques are fairly universal
  • 35. References • Entity Resolution and Information Quality, John R. Talburt, Amazon link: http://pewee.ws/6q • How to solve common Data Quality Problems using Data Quality Services (Part 1), Paras Doshi, http://pewee.ws/6t • “Research: 2012 BI and Information Management”, Douglas Henschen, Information Week, http://pewee.ws/6y • “Master Data and Data Quality Management in SQL Server 2012”, Mark Gschwind, http://pewee.ws/6z
  • 36. SimMetrics References • Java and C# (some ported to F#) • http://sourceforge.net/projects/simmetrics/files/ • Installing into SQL Server • http://anastasiosyal.com/POST/2009/01/11/18.ASPX • Originator's current page: • http://www.aktors.org/technologies/simmetrics/ • Archives of original project pages: • http://web.archive.org/web/20081230184321/http://www.dcs.shef.ac.uk/~s am/simmetrics.html • http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~s am/stringmetrics.html

Notas del editor

  1. This is a good example of false positives vs false negatives. In general, false negatives are harder to detect and resolve. In this false positive, the affected person raised a fuss and had all kinds of documentation so show otherwise. The false negative was unaware of anything, and probably wouldn’t have spoken up if he was.ER processes are designed to avoid false positives, even if that means false negatives occur. What’s the greater dilemma in your enterprise—false positive, or false negative?
  2. You can get a Ph.D. in ERIQ, so we’re obviously going to pull this off in an hour!
  3. Information quality, entity resolution and identity resolution are the techniques we clean up and prevent messes like we’ve seen here.Bands and candy are fun to laugh about, but your customers have different expectations. They expect you to know them as if you’re the corner florist and they’re a regular customer. They want this to happen often without giving you their real identity and telling you very little information about themselves. Customers may create multiple accounts of their own, and a single household may have many accounts with you. Knowledge is power, and understanding the relationships—where there are any—between the different accounts in your system can drive business decisions.ER is not always about identifying a specific person. Instead of trying to determine an exact identity, merely placing an unknown person into correct demographic buckets can help you anticipate their wants and needs, and serve them well focused product recommendations and ads.20 years ago, people were known by their real names and a physical address. Today, people may be known by several screen names and primarily contacted online.There is, however, something completely replacing a physical address and consolidating all the virtual identities—cell phones.In some circumstances, ER is used to connect virtual and real worlds and answer questions like:Do our mailers influence customers on our websites?Are certain demographics over- or under-represented in our site or application populations?The data about you is so valuable there’s a maxim in online marketing: if the service is free, you’re the product!
  4. Knowledge is power, and enterprises are adopting and building knowledge systems. There is another chart in this same survey showing some of the greatest barriers to implementing an information management system is dirty or unreliable data.
  5. Entity resolution is the name given to both the overall process as well as a specific step of the process.ERA1: Or, have all data accessible from a central location. This is the “Map” in “MapReduce”.ERA3 and ERA4: We can link entities without knowing who they are. Identity resolution is where we link unknown entities to known identities.In ERA3 it’s important to note you are linking data into a single view. Typically you don’t destroy older data, but very often you extract the finished goods to a data mart and use that going forward.ERA5: “Household” analysis, where you aren’t reducing to a single entity, but instead are finding relationships between several different entities. This could be linking family members together, or some other demographic relationship.We’re going to focus on ERA2 and ERA3 today.
  6. Information Quality and Entity Resolution are complementary ideas.
  7. Information Quality is a general term for an activity which includes data quality, management of master data, and ongoing data policies.One reason why I’m not giving hard and fast rules which define quality of data is because what’s right for one enterprise may not work in another. The techniques and theory are the same, but the exact rules will differ.Master data management – critical business entity types should be synchronized across the entire enterprise.It’s important for enterprises to realize that data quality is not just an IT function. Data governance is a function for everyone in the enterprise. Before undertaking an information quality effort, you should develop an idea of how to measure success, and how to demonstrate the value over time. You’re not doing this just because you feel like it.
  8. These are some of what you’ll do when cleaning up or standardizing information. Some of these you can accomplish in ERA1.Bucketing also carrier route sort for direct marketing
  9. This matrix shows several common types of data quality issues, and how the data looks after being scrubbed. By the end of this phase of work, your data should be readyI did not make this chart, it’s from the presentation cited on the slide.
  10. Differences come from a number of causes—misunderstandings, innocent mistakes, being rushed, average typos.
  11. Soundex is a phonetic algorithm, used to try and detect homophones. Soundex is built into SQL Server, but it has some serious issues.
  12. This is higher level analysis