This document discusses problems with identifying clients and matching transactions, and proposes solutions using modified Damerau-Levenshtein and edit distance algorithms. It presents an approach involving comparing attributes across client and transaction tables, determining attribute similarity probabilities and weights, and calculating a total probability for matching. Thresholds would be set at 94% for identification and 54% for similarity. The algorithm would be tailored for specific problems and regions. Evaluation and future improvements are also discussed.
On Client and Transaction Identification and Matching Problems
1. On Client and Transaction
Identification and Matching
Problems
Veljko Pejović
veljkoveljko@gmail.com
Coauthors: Emil Varga, Marko Stanković
2. Presentation Outline
Introduction
Data Input and Identification Problems
Known Solutions
Damerau Edit Distance Algorithm Modifications
Algorithm Application, Weight Factors
Determination
Algorithm Regionalization
Evaluation
Conclusion and Future Work Guidelines
3. Introduction
Problems
Data Input Problems
Unnecessary Repetition of a Character (Jaack)
Character Permutation (Jakc)
Character Omission (Jck)
Initials, Abbreviations etc. (J.W.)
Identification Problems
Attribute Comparison
Weight Factor Determination for each Attribute Pair
Mind the Correlations
4. Identification Problems
Identification Criteria
Similarity of corresponding fields brings us
closer to entity identification
Identification Threshold
Similarity probability above which we have
identified the client – higher threshold
Similarity Threshold
Similarity probability above which we can claim
similarity of two entities – lower threshold
5. Known Solutions
LCS Approach
Finds the longest common subsequence of two
strings
Example: 'GCTAT' i 'CGATTA' the longest
common subsequence is 'GTT'
Ratcliff Obershelp Algorithm
Returns similarity percentage of two strings
6. Known Solutions
Edit Distance Approach
Edit Distance – Difference between two
strings observed through operations
necessary for bringing them into the same
state
Every operation has its cost
Algorithms
Levenshtein – 3 basic operations
Damerau Edit Distance algorithm
Additional operation – character transposition
7. Damerau Edit Distance Algorithm
Modifications
Changes will be made in order to adjust the
algorithm to the given problem
Solving the Following Key Problems
Unnecessary Repetition of a Character
Lower cost of insertion operation
Initials usage
Comparison of starting letters only
Separator omission
Separators will be ignored
Abbreviation usage
Abbreviation Dictionary (data mining)
8. Algorithm Application, Weight Factors
Determination
Table Clients: Table Transactions:
Name Name
Surname Surname
Personal ID Number Personal ID Number
City City
Street Street
Apt. No. Apt. No.
Zip Code Zip Code
Date of Birth Date of Birth
Client ID (as a primary key) Transaction ID (as a primary
key)
Internal Transaction Number
Type of Transaction
Amount
Account No.
9. Algorithm Application, Weight Factors
Determination
Table Result:
Client ID
Transaction ID
Probability for Name
Probability for Surname
Probability for Personal ID Number
Probability for City
Probability for Street
Probability for Apt. No.
Probability for Zip Code
Probability for Date of Birth
Total Probability
Result
10. Algorithm Application, Weight Factors
Determination
Comparison of corresponding attributes in
two tables (Clients and Transactions)
Each calculated similarity probability is
stored in table Result
Iteratively for every pair of attributes
11. Algorithm Application, Weight Factors
Determination
Weight factors should
be well determined
The leaves represent
probability for
similarity of two
attributes
[-100%, 100%]
The branches
represent weight
factors [0, 1]
12. Algorithm Application, Weight Factors
Determination
Certain attributes
correlate
Data redundancy
Dictionary Table
Total probability
calculation:
⎧ pid > I, pid
⎪nad > I, nad
⎪
r =⎨
⎪ pid > 0 ∧ nad > 0, pid * q + nad * (1 − q)
⎪0
⎩
⎧ pid
pid > nad,
⎪
⎪ nad
q=⎨
⎪nad > pid , nad
⎪
⎩ pid
13. Algorithm Application, Weight Factors
Determination
Thresholds:
Identification
threshold ~ 94 %
Similarity threshold
~ 54 %
Results above the
Similarity threshold
will be stored in table
Result
14. Algorithm Regionalization
Common names/surnames
The more common name pair – the less influence it has
on total similarity. Adjustable weight factors
Characteristic suffixes, infixes i prefixes ( -ić, -
Van-, Mc- )
These will be ignored during the matching phase
Different alphabets
Alphabet “Leveling” –
ћирилица, ćirilica, cirilica…
15. Evaluation
Competitive solution
Based on simple LCS algorithm
Test vectors, Example
“Z. Mihajlović, Sremska 33, Bgf, 11000”
“Zoran Mihailović, Sremska 33, Beograd 11000”
Result evaluation
16. Conclusion And Future Work Guidelines
Main strong points of the proposed
solution:
Based on well developed and examined
algorithm
Adjusted to one particular problem
Dynamic reliability improvement
Flexibility
Regionalization
17. Conclusion And Future Work Guidelines
Possible Improvements
Automatic database update after the identification
process
Coding an address to “Address code”
Mapping the standard key settings on different keyboard
layouts
Dynamic value change of identification and similarity
threshold – adjust to the users’ expectations
System should be verified in “real world” surrounding
19. On Client and Transaction
Identification and Matching
Problems
Veljko Pejović
veljkoveljko@gmail.com
Coauthors: Emil Varga, Marko Stanković
- Comments And Questions, Please!