SlideShare una empresa de Scribd logo
1 de 19
On Client and Transaction
Identification and Matching
                  Problems

                    Veljko Pejović
                           veljkoveljko@gmail.com


    Coauthors: Emil Varga, Marko Stanković
Presentation Outline
 Introduction
 Data Input and Identification Problems
 Known Solutions
 Damerau Edit Distance Algorithm Modifications
 Algorithm Application, Weight Factors
 Determination
 Algorithm Regionalization
 Evaluation
 Conclusion and Future Work Guidelines
Introduction

 Problems
   Data Input Problems
     Unnecessary Repetition of a Character (Jaack)
     Character Permutation (Jakc)
     Character Omission (Jck)
     Initials, Abbreviations etc. (J.W.)
   Identification Problems
     Attribute Comparison
     Weight Factor Determination for each Attribute Pair
     Mind the Correlations
Identification Problems

 Identification Criteria
   Similarity of corresponding fields brings us
   closer to entity identification
 Identification Threshold
   Similarity probability above which we have
   identified the client – higher threshold
 Similarity Threshold
   Similarity probability above which we can claim
   similarity of two entities – lower threshold
Known Solutions

 LCS Approach
   Finds the longest common subsequence of two
   strings
   Example: 'GCTAT' i 'CGATTA' the longest
   common subsequence is 'GTT'
 Ratcliff Obershelp Algorithm
   Returns similarity percentage of two strings
Known Solutions

 Edit Distance Approach
   Edit Distance – Difference between two
   strings observed through operations
   necessary for bringing them into the same
   state
   Every operation has its cost
 Algorithms
   Levenshtein – 3 basic operations
   Damerau Edit Distance algorithm
     Additional operation – character transposition
Damerau Edit Distance Algorithm
           Modifications
Changes will be made in order to adjust the
algorithm to the given problem

Solving the Following Key Problems
  Unnecessary Repetition of a Character
     Lower cost of insertion operation
  Initials usage
     Comparison of starting letters only
  Separator omission
     Separators will be ignored
  Abbreviation usage
     Abbreviation Dictionary (data mining)
Algorithm Application, Weight Factors
           Determination
Table Clients:                    Table Transactions:
   Name                              Name
   Surname                           Surname
   Personal ID Number                Personal ID Number
   City                              City
   Street                            Street
   Apt. No.                          Apt. No.
   Zip Code                          Zip Code
   Date of Birth                     Date of Birth
   Client ID (as a primary key)      Transaction ID (as a primary
                                     key)
                                     Internal Transaction Number
                                     Type of Transaction
                                     Amount
                                     Account No.
Algorithm Application, Weight Factors
           Determination
Table Result:
  Client ID
  Transaction ID
  Probability for Name
  Probability for Surname
  Probability for Personal ID Number
  Probability for City
  Probability for Street
  Probability for Apt. No.
  Probability for Zip Code
  Probability for Date of Birth
  Total Probability
  Result
Algorithm Application, Weight Factors
           Determination


Comparison of corresponding attributes in
two tables (Clients and Transactions)
Each calculated similarity probability is
stored in table Result

Iteratively for every pair of attributes
Algorithm Application, Weight Factors
           Determination
Weight factors should
be well determined

The leaves represent
probability for
similarity of two
attributes
[-100%, 100%]

The branches
represent weight
factors [0, 1]
Algorithm Application, Weight Factors
               Determination
   Certain attributes
   correlate

   Data redundancy
         Dictionary Table

   Total probability
   calculation:
   ⎧ pid > I, pid
   ⎪nad > I, nad
   ⎪
r =⎨
   ⎪ pid > 0 ∧ nad > 0, pid * q + nad * (1 − q)
   ⎪0
   ⎩
  ⎧            pid
    pid > nad,
  ⎪
  ⎪            nad
q=⎨
  ⎪nad > pid , nad
  ⎪
  ⎩             pid
Algorithm Application, Weight Factors
           Determination
Thresholds:

   Identification
   threshold ~ 94 %

   Similarity threshold
   ~ 54 %

Results above the
Similarity threshold
will be stored in table
Result
Algorithm Regionalization

Common names/surnames
  The more common name pair – the less influence it has
  on total similarity. Adjustable weight factors
Characteristic suffixes, infixes i prefixes ( -ić, -
Van-, Mc- )
  These will be ignored during the matching phase
Different alphabets
  Alphabet “Leveling” –
  ћирилица, ćirilica, cirilica…
Evaluation

 Competitive solution
   Based on simple LCS algorithm
 Test vectors, Example
   “Z. Mihajlović, Sremska 33, Bgf, 11000”
   “Zoran Mihailović, Sremska 33, Beograd 11000”
 Result evaluation
Conclusion And Future Work Guidelines

Main strong points of the proposed
solution:
  Based on well developed and examined
  algorithm
  Adjusted to one particular problem
  Dynamic reliability improvement
  Flexibility
  Regionalization
Conclusion And Future Work Guidelines

Possible Improvements
  Automatic database update after the identification
  process
  Coding an address to “Address code”
  Mapping the standard key settings on different keyboard
  layouts
  Dynamic value change of identification and similarity
  threshold – adjust to the users’ expectations
  System should be verified in “real world” surrounding
Thank You!




      - Comments And Questions, Please!
On Client and Transaction
Identification and Matching
                  Problems

                    Veljko Pejović
                            veljkoveljko@gmail.com


    Coauthors: Emil Varga, Marko Stanković


                   - Comments And Questions, Please!

Más contenido relacionado

Similar a On Client and Transaction Identification and Matching Problems

A Critical Look at Fixtures
A Critical Look at FixturesA Critical Look at Fixtures
A Critical Look at FixturesActsAsCon
 
Vpriv Ready
Vpriv ReadyVpriv Ready
Vpriv ReadyLangLin
 
Automated Correlation Discovery for Semi-Structured Business Processes
Automated Correlation Discovery for Semi-Structured Business ProcessesAutomated Correlation Discovery for Semi-Structured Business Processes
Automated Correlation Discovery for Semi-Structured Business ProcessesSzabolcs Rozsnyai
 
Introduction to javascript.ppt
Introduction to javascript.pptIntroduction to javascript.ppt
Introduction to javascript.pptBArulmozhi
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018Manish Pandey
 
Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...
Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...
Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...AlexACMSC
 
Developer power tools
Developer power toolsDeveloper power tools
Developer power toolsNick Harrison
 
Alpine ML Talk: Vtreat: A Package for Automating Variable Treatment in R By ...
Alpine ML Talk:  Vtreat: A Package for Automating Variable Treatment in R By ...Alpine ML Talk:  Vtreat: A Package for Automating Variable Treatment in R By ...
Alpine ML Talk: Vtreat: A Package for Automating Variable Treatment in R By ...Chester Chen
 
Quality Assurance
Quality AssuranceQuality Assurance
Quality AssuranceKiran Kumar
 
How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...Richard Minerich
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured predictionzukun
 
Delightful steps to becoming a functioning user of Step Functions
Delightful steps to becoming a functioning user of Step FunctionsDelightful steps to becoming a functioning user of Step Functions
Delightful steps to becoming a functioning user of Step FunctionsYan Cui
 
Dqs mds-matching 15042015
Dqs mds-matching 15042015Dqs mds-matching 15042015
Dqs mds-matching 15042015Neil Hambly
 
Prototyping Business Processes
Prototyping Business ProcessesPrototyping Business Processes
Prototyping Business ProcessesAng Chen
 
Comparative study of various approaches for transaction Fraud Detection using...
Comparative study of various approaches for transaction Fraud Detection using...Comparative study of various approaches for transaction Fraud Detection using...
Comparative study of various approaches for transaction Fraud Detection using...Pratibha Singh
 
C sharp fundamentals Part I
C sharp fundamentals Part IC sharp fundamentals Part I
C sharp fundamentals Part IDevMix
 

Similar a On Client and Transaction Identification and Matching Problems (20)

Data Means MeMatch algorithm
Data Means MeMatch algorithmData Means MeMatch algorithm
Data Means MeMatch algorithm
 
A Critical Look at Fixtures
A Critical Look at FixturesA Critical Look at Fixtures
A Critical Look at Fixtures
 
Vpriv Ready
Vpriv ReadyVpriv Ready
Vpriv Ready
 
Automated Correlation Discovery for Semi-Structured Business Processes
Automated Correlation Discovery for Semi-Structured Business ProcessesAutomated Correlation Discovery for Semi-Structured Business Processes
Automated Correlation Discovery for Semi-Structured Business Processes
 
Introduction to javascript.ppt
Introduction to javascript.pptIntroduction to javascript.ppt
Introduction to javascript.ppt
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018
 
Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...
Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...
Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...
 
Developer power tools
Developer power toolsDeveloper power tools
Developer power tools
 
Alpine ML Talk: Vtreat: A Package for Automating Variable Treatment in R By ...
Alpine ML Talk:  Vtreat: A Package for Automating Variable Treatment in R By ...Alpine ML Talk:  Vtreat: A Package for Automating Variable Treatment in R By ...
Alpine ML Talk: Vtreat: A Package for Automating Variable Treatment in R By ...
 
Blackbox
BlackboxBlackbox
Blackbox
 
Quality Assurance
Quality AssuranceQuality Assurance
Quality Assurance
 
How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
 
Delightful steps to becoming a functioning user of Step Functions
Delightful steps to becoming a functioning user of Step FunctionsDelightful steps to becoming a functioning user of Step Functions
Delightful steps to becoming a functioning user of Step Functions
 
Dqs mds-matching 15042015
Dqs mds-matching 15042015Dqs mds-matching 15042015
Dqs mds-matching 15042015
 
How We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad GuysHow We Use Functional Programming to Find the Bad Guys
How We Use Functional Programming to Find the Bad Guys
 
Prototyping Business Processes
Prototyping Business ProcessesPrototyping Business Processes
Prototyping Business Processes
 
Lecture7 pattern
Lecture7 patternLecture7 pattern
Lecture7 pattern
 
Comparative study of various approaches for transaction Fraud Detection using...
Comparative study of various approaches for transaction Fraud Detection using...Comparative study of various approaches for transaction Fraud Detection using...
Comparative study of various approaches for transaction Fraud Detection using...
 
C sharp fundamentals Part I
C sharp fundamentals Part IC sharp fundamentals Part I
C sharp fundamentals Part I
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

On Client and Transaction Identification and Matching Problems

  • 1. On Client and Transaction Identification and Matching Problems Veljko Pejović veljkoveljko@gmail.com Coauthors: Emil Varga, Marko Stanković
  • 2. Presentation Outline Introduction Data Input and Identification Problems Known Solutions Damerau Edit Distance Algorithm Modifications Algorithm Application, Weight Factors Determination Algorithm Regionalization Evaluation Conclusion and Future Work Guidelines
  • 3. Introduction Problems Data Input Problems Unnecessary Repetition of a Character (Jaack) Character Permutation (Jakc) Character Omission (Jck) Initials, Abbreviations etc. (J.W.) Identification Problems Attribute Comparison Weight Factor Determination for each Attribute Pair Mind the Correlations
  • 4. Identification Problems Identification Criteria Similarity of corresponding fields brings us closer to entity identification Identification Threshold Similarity probability above which we have identified the client – higher threshold Similarity Threshold Similarity probability above which we can claim similarity of two entities – lower threshold
  • 5. Known Solutions LCS Approach Finds the longest common subsequence of two strings Example: 'GCTAT' i 'CGATTA' the longest common subsequence is 'GTT' Ratcliff Obershelp Algorithm Returns similarity percentage of two strings
  • 6. Known Solutions Edit Distance Approach Edit Distance – Difference between two strings observed through operations necessary for bringing them into the same state Every operation has its cost Algorithms Levenshtein – 3 basic operations Damerau Edit Distance algorithm Additional operation – character transposition
  • 7. Damerau Edit Distance Algorithm Modifications Changes will be made in order to adjust the algorithm to the given problem Solving the Following Key Problems Unnecessary Repetition of a Character Lower cost of insertion operation Initials usage Comparison of starting letters only Separator omission Separators will be ignored Abbreviation usage Abbreviation Dictionary (data mining)
  • 8. Algorithm Application, Weight Factors Determination Table Clients: Table Transactions: Name Name Surname Surname Personal ID Number Personal ID Number City City Street Street Apt. No. Apt. No. Zip Code Zip Code Date of Birth Date of Birth Client ID (as a primary key) Transaction ID (as a primary key) Internal Transaction Number Type of Transaction Amount Account No.
  • 9. Algorithm Application, Weight Factors Determination Table Result: Client ID Transaction ID Probability for Name Probability for Surname Probability for Personal ID Number Probability for City Probability for Street Probability for Apt. No. Probability for Zip Code Probability for Date of Birth Total Probability Result
  • 10. Algorithm Application, Weight Factors Determination Comparison of corresponding attributes in two tables (Clients and Transactions) Each calculated similarity probability is stored in table Result Iteratively for every pair of attributes
  • 11. Algorithm Application, Weight Factors Determination Weight factors should be well determined The leaves represent probability for similarity of two attributes [-100%, 100%] The branches represent weight factors [0, 1]
  • 12. Algorithm Application, Weight Factors Determination Certain attributes correlate Data redundancy Dictionary Table Total probability calculation: ⎧ pid > I, pid ⎪nad > I, nad ⎪ r =⎨ ⎪ pid > 0 ∧ nad > 0, pid * q + nad * (1 − q) ⎪0 ⎩ ⎧ pid pid > nad, ⎪ ⎪ nad q=⎨ ⎪nad > pid , nad ⎪ ⎩ pid
  • 13. Algorithm Application, Weight Factors Determination Thresholds: Identification threshold ~ 94 % Similarity threshold ~ 54 % Results above the Similarity threshold will be stored in table Result
  • 14. Algorithm Regionalization Common names/surnames The more common name pair – the less influence it has on total similarity. Adjustable weight factors Characteristic suffixes, infixes i prefixes ( -ić, - Van-, Mc- ) These will be ignored during the matching phase Different alphabets Alphabet “Leveling” – ћирилица, ćirilica, cirilica…
  • 15. Evaluation Competitive solution Based on simple LCS algorithm Test vectors, Example “Z. Mihajlović, Sremska 33, Bgf, 11000” “Zoran Mihailović, Sremska 33, Beograd 11000” Result evaluation
  • 16. Conclusion And Future Work Guidelines Main strong points of the proposed solution: Based on well developed and examined algorithm Adjusted to one particular problem Dynamic reliability improvement Flexibility Regionalization
  • 17. Conclusion And Future Work Guidelines Possible Improvements Automatic database update after the identification process Coding an address to “Address code” Mapping the standard key settings on different keyboard layouts Dynamic value change of identification and similarity threshold – adjust to the users’ expectations System should be verified in “real world” surrounding
  • 18. Thank You! - Comments And Questions, Please!
  • 19. On Client and Transaction Identification and Matching Problems Veljko Pejović veljkoveljko@gmail.com Coauthors: Emil Varga, Marko Stanković - Comments And Questions, Please!