SlideShare una empresa de Scribd logo
1 de 13
Learning Blocking Schemes for Record Linkage Matthew Michelson & Craig A. Knoblock University of Southern California / Information Sciences Institute Person Validation and Entity Resolution Conference May 23rd, 2011, Washington D.C.
Record Linkage Census Data A.I. Researchers First Name Last Name Phone Zip Matt Michelson 555-5555 12345 Jane Jones 555-1111 12345 Joe Smith 555-0011 12345 First Name Last Name Phone Zip Matthew Michelson 555-5555 12345 Jim Jones 555-1111 12345 Joe Smeth 555-0011 12345 match match
Blocking – Generating Candidates (token, last name) AND (1 st  letter, first name) = block-key   First Name Last Name Matt Michelson Jane Jones First Name Last Name Matthew Michelson Jim Jones First Name Last Name Zip Matt Michelson 12345 Matt Michelson 12345 Matt Michelson 12345 First Name Last Name Zip Matthew Michelson 12345 Jim Jones 12345 Joe Smeth 12345 (token, zip) . . . . . .
Blocking - Intuition Census Data Zip  =  ‘ 12345 ’ 1 Block of  12345  Zips     Compare to the block First Name Last Name Zip Matt Michelson 12345 Jane Jones 12345 Joe Smith 12345
Blocking Effectiveness Reduction Ratio  ( RR ) = 1 – ||C|| / (||S|| *|| T||) S,T are data sets; C is the set of candidates Pairs Completeness  ( PC ) = S m  / N m S m  = # true matches in candidates, N m  = # true matches between S and T (token, last name) AND (1 st  letter, first name)   RR  = 1 – 2/9  ≈ 0.78 PC  = 1 / 2  =  0.50 Examples: (token, zip) RR  =  1 – 9/9 = 0.0 PC  = 2 / 2 = 1.0
How to choose methods and attributes? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Learning Schemes – Intuition ,[object Object],[object Object],[object Object],[object Object],[object Object]
Example to clear things up! Space of training examples = Not match = Match Rule 1 :- ( zip|token ) & ( first|token ) Rule 2 :- ( last|1 st  Letter ) & ( first|1 st  Letter ) Final Rule :- [(zip|token) & (first|token)] UNION [(last|1 st  Letter) & (first|1 st  letter)]
SCA: propositional rules ,[object Object],[object Object],[object Object],SEQUENTIAL-COVERING( class, attributes, examples, threshold) LearnedRules ← {} Rule ← LEARN-ONE-RULE(class, attributes, examples) While examples left to cover, do LearnedRules ← LearnedRules  U Rule Examples  ← Examples – {Examples covered by Rule} Rule ← LEARN-ONE-RULE(class, attributes, examples) If Rule contains any previously learned rules, remove them Return LearnedRules
SCA: propositional rules ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Learn-One-Rule ,[object Object],[object Object],[object Object],[object Object],[object Object],(token, zip) (token, last name)  (1 st  letter, last name)  (token, first name)  … (1st letter, last name)  (token, first name)  …
Experiments HFM  = ({token, make}  ∩ {token, year}  ∩ {token, trim} ) U  ({1 st  letter, make} ∩ {1 st  letter, year} ∩ {1 st  letter, trim})  U ({synonym, trim}) BSL  = ({token, model}  ∩ {token, year}  ∩ {token, trim} ) U  ({token, model} ∩ {token, year} ∩ {synonym, trim})  Restaurants RR PC Marlin 55.35 100.00 BSL 99.26 98.16 BSL (10%) 99.57 93.48 Cars RR PC HFM 47.92 99.97 BSL 99.86 99.92 BSL (10%) 99.87 99.88 Census RR PC Best 5 Winkler 99.52 99.16 Adaptive Filtering 99.9 92.7 BSL 98.12 99.85 BSL (10%) 99.50 99.13
Conclusions and Future Work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Más contenido relacionado

Destacado

Destacado (9)

Confidential Computing - Analysing Data Without Seeing Data
Confidential Computing - Analysing Data Without Seeing DataConfidential Computing - Analysing Data Without Seeing Data
Confidential Computing - Analysing Data Without Seeing Data
 
The Eurogene Project
The Eurogene ProjectThe Eurogene Project
The Eurogene Project
 
Big Data at Ancestry.com
Big Data at Ancestry.comBig Data at Ancestry.com
Big Data at Ancestry.com
 
Error Tolerant Record Matching PVERConf_May2011
Error Tolerant Record Matching PVERConf_May2011Error Tolerant Record Matching PVERConf_May2011
Error Tolerant Record Matching PVERConf_May2011
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Claudia medina: Linking Health Records for Population Health Research in Brazil.
Claudia medina: Linking Health Records for Population Health Research in Brazil.Claudia medina: Linking Health Records for Population Health Research in Brazil.
Claudia medina: Linking Health Records for Population Health Research in Brazil.
 
2015 06-27 use-r2015_dendextend_tal_galili_01
2015 06-27 use-r2015_dendextend_tal_galili_012015 06-27 use-r2015_dendextend_tal_galili_01
2015 06-27 use-r2015_dendextend_tal_galili_01
 
Indexing Techniques for Scalable Record Linkage and Deduplication
Indexing Techniques for Scalable Record Linkage and DeduplicationIndexing Techniques for Scalable Record Linkage and Deduplication
Indexing Techniques for Scalable Record Linkage and Deduplication
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Learning Blocking Schemes for Record Linkage_PVERConf_May2011

  • 1. Learning Blocking Schemes for Record Linkage Matthew Michelson & Craig A. Knoblock University of Southern California / Information Sciences Institute Person Validation and Entity Resolution Conference May 23rd, 2011, Washington D.C.
  • 2. Record Linkage Census Data A.I. Researchers First Name Last Name Phone Zip Matt Michelson 555-5555 12345 Jane Jones 555-1111 12345 Joe Smith 555-0011 12345 First Name Last Name Phone Zip Matthew Michelson 555-5555 12345 Jim Jones 555-1111 12345 Joe Smeth 555-0011 12345 match match
  • 3. Blocking – Generating Candidates (token, last name) AND (1 st letter, first name) = block-key First Name Last Name Matt Michelson Jane Jones First Name Last Name Matthew Michelson Jim Jones First Name Last Name Zip Matt Michelson 12345 Matt Michelson 12345 Matt Michelson 12345 First Name Last Name Zip Matthew Michelson 12345 Jim Jones 12345 Joe Smeth 12345 (token, zip) . . . . . .
  • 4. Blocking - Intuition Census Data Zip = ‘ 12345 ’ 1 Block of 12345 Zips  Compare to the block First Name Last Name Zip Matt Michelson 12345 Jane Jones 12345 Joe Smith 12345
  • 5. Blocking Effectiveness Reduction Ratio ( RR ) = 1 – ||C|| / (||S|| *|| T||) S,T are data sets; C is the set of candidates Pairs Completeness ( PC ) = S m / N m S m = # true matches in candidates, N m = # true matches between S and T (token, last name) AND (1 st letter, first name) RR = 1 – 2/9 ≈ 0.78 PC = 1 / 2 = 0.50 Examples: (token, zip) RR = 1 – 9/9 = 0.0 PC = 2 / 2 = 1.0
  • 6.
  • 7.
  • 8. Example to clear things up! Space of training examples = Not match = Match Rule 1 :- ( zip|token ) & ( first|token ) Rule 2 :- ( last|1 st Letter ) & ( first|1 st Letter ) Final Rule :- [(zip|token) & (first|token)] UNION [(last|1 st Letter) & (first|1 st letter)]
  • 9.
  • 10.
  • 11.
  • 12. Experiments HFM = ({token, make} ∩ {token, year} ∩ {token, trim} ) U ({1 st letter, make} ∩ {1 st letter, year} ∩ {1 st letter, trim}) U ({synonym, trim}) BSL = ({token, model} ∩ {token, year} ∩ {token, trim} ) U ({token, model} ∩ {token, year} ∩ {synonym, trim}) Restaurants RR PC Marlin 55.35 100.00 BSL 99.26 98.16 BSL (10%) 99.57 93.48 Cars RR PC HFM 47.92 99.97 BSL 99.86 99.92 BSL (10%) 99.87 99.88 Census RR PC Best 5 Winkler 99.52 99.16 Adaptive Filtering 99.9 92.7 BSL 98.12 99.85 BSL (10%) 99.50 99.13
  • 13.

Notas del editor

  1. Beam search – adds backtracking. Search s.t. keep maximizing RR, and cover min. PC
  2. 1% of RR and PC mean different things. EG in Marlin dataset, 50% test ~50 matches, so 2% diff. With 90% data, our drop in PC represents ~ 6.5 matches on avg. FOR RR cars, going from RR: 98.346 to RR: 98.337 increases by 402 cands. Red box = not stat. sig with 2 tail t-test, alpha=0.05