On Client and Transaction Identification and Matching Problems

On Client and Transaction
Identification and Matching
Problems

Veljko Pejović
veljkoveljko@gmail.com

Coauthors: Emil Varga, Marko Stanković

Presentation Outline
Introduction
Data Input and Identification Problems
Known Solutions
Damerau Edit Distance Algorithm Modifications
Algorithm Application, Weight Factors
Determination
Algorithm Regionalization
Evaluation
Conclusion and Future Work Guidelines

Introduction

Problems
Data Input Problems
Unnecessary Repetition of a Character (Jaack)
Character Permutation (Jakc)
Character Omission (Jck)
Initials, Abbreviations etc. (J.W.)
Identification Problems
Attribute Comparison
Weight Factor Determination for each Attribute Pair
Mind the Correlations

Identification Problems

Identification Criteria
Similarity of corresponding fields brings us
closer to entity identification
Identification Threshold
Similarity probability above which we have
identified the client – higher threshold
Similarity Threshold
Similarity probability above which we can claim
similarity of two entities – lower threshold

Known Solutions

LCS Approach
Finds the longest common subsequence of two
strings
Example: 'GCTAT' i 'CGATTA' the longest
common subsequence is 'GTT'
Ratcliff Obershelp Algorithm
Returns similarity percentage of two strings

Known Solutions

Edit Distance Approach
Edit Distance – Difference between two
strings observed through operations
necessary for bringing them into the same
state
Every operation has its cost
Algorithms
Levenshtein – 3 basic operations
Damerau Edit Distance algorithm
Additional operation – character transposition

Damerau Edit Distance Algorithm
Modifications
Changes will be made in order to adjust the
algorithm to the given problem

Solving the Following Key Problems
Unnecessary Repetition of a Character
Lower cost of insertion operation
Initials usage
Comparison of starting letters only
Separator omission
Separators will be ignored
Abbreviation usage
Abbreviation Dictionary (data mining)

Determination
Table Clients: Table Transactions:
Name Name
Surname Surname
Personal ID Number Personal ID Number
City City
Street Street
Apt. No. Apt. No.
Zip Code Zip Code
Date of Birth Date of Birth
Client ID (as a primary key) Transaction ID (as a primary
key)
Internal Transaction Number
Type of Transaction
Amount
Account No.

Determination
Table Result:
Client ID
Transaction ID
Probability for Name
Probability for Surname
Probability for Personal ID Number
Probability for City
Probability for Street
Probability for Apt. No.
Probability for Zip Code
Probability for Date of Birth
Total Probability
Result

Determination

Comparison of corresponding attributes in
two tables (Clients and Transactions)
Each calculated similarity probability is
stored in table Result

Iteratively for every pair of attributes

Determination
Weight factors should
be well determined

The leaves represent
probability for
similarity of two
attributes
[-100%, 100%]

The branches
represent weight
factors [0, 1]

Determination
Certain attributes
correlate

Data redundancy
Dictionary Table

Total probability
calculation:
⎧ pid > I, pid
⎪nad > I, nad
⎪
r =⎨
⎪ pid > 0 ∧ nad > 0, pid * q + nad * (1 − q)
⎪0
⎩
⎧ pid
pid > nad,
⎪
⎪ nad
q=⎨
⎪nad > pid , nad
⎪
⎩ pid

Determination
Thresholds:

Identification
threshold ~ 94 %

Similarity threshold
~ 54 %

Results above the
Similarity threshold
will be stored in table
Result

Algorithm Regionalization

Common names/surnames
The more common name pair – the less influence it has
on total similarity. Adjustable weight factors
Characteristic suffixes, infixes i prefixes ( -ić, -
Van-, Mc- )
These will be ignored during the matching phase
Different alphabets
Alphabet “Leveling” –
ћирилица, ćirilica, cirilica…

Evaluation

Competitive solution
Based on simple LCS algorithm
Test vectors, Example
“Z. Mihajlović, Sremska 33, Bgf, 11000”
“Zoran Mihailović, Sremska 33, Beograd 11000”
Result evaluation

Conclusion And Future Work Guidelines

Main strong points of the proposed
solution:
Based on well developed and examined
algorithm
Adjusted to one particular problem
Dynamic reliability improvement
Flexibility
Regionalization

Conclusion And Future Work Guidelines

Possible Improvements
Automatic database update after the identification
process
Coding an address to “Address code”
Mapping the standard key settings on different keyboard
layouts
Dynamic value change of identification and similarity
threshold – adjust to the users’ expectations
System should be verified in “real world” surrounding

Thank You!

- Comments And Questions, Please!

On Client and Transaction
Identification and Matching
Problems

Veljko Pejović
veljkoveljko@gmail.com

Coauthors: Emil Varga, Marko Stanković

- Comments And Questions, Please!

On Client and Transaction Identification and Matching Problems

Recomendados

Recomendados

Más contenido relacionado

Similar a On Client and Transaction Identification and Matching Problems

Similar a On Client and Transaction Identification and Matching Problems (20)

Último

Último (20)

On Client and Transaction Identification and Matching Problems