TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
online Record Linkage
1. Efficient Techniques for Online
Record Linkage
Guided By,
Mrs.k.Sujatha B.E.,M.Tech.,(Ph.D..,)
Presented by,
D. Angelin chitra,
A. Jainambu sariba,
S. Sathiya priya,
K. Sridevi.
1
3. Introduction
Databases frequently contain duplicate fields and
records that refer to the same real-world entity.
The data needed to support these decisions are often
scattered in heterogeneous distributed databases.
Heterogeneous databases are usually designed and
managed by different organizations there may be no
common candidate key for linking the records.
If the databases use the same set of design standards
linking can easily be done using the primary key.
In this project we are developing a software which
will produce the relevant links from heterogenous
database with reference to the keyword.
3
4. ABSTRACT
Matching records that refer to the same entity across databases is
becoming an increasingly important part of many data mining
projects, as often data from multiple sources needs to be matched in
order to enrich data or improve its quality.
Record linkage is the computation of the associations among
records of multiple databases.
Matching data from heterogeneous data source has been a real
problem.
Statistical record linkage techniques could be used for resolving
this problem but it causes communication bottleneck in a
distributed environment.
A matching tree is used to overcome communication overhead and
give matching decision as obtained using the conventional linkage
technique.
4
5. Literature Review
Existing System
No common candidate key for linking the records in
Heterogeneous databases.
It is possible to use common non key attributes to access
the heterogeneous database, but the result obtained using
these attributes may not always be accurate.
When the matching records reside at a remote site,
existing techniques cannot be directly applied because
they would involve transferring the entire remote
relation, thereby incurring a huge communication
overhead.
Different ranking algorithm are used for the search
which may be time consuming.
5
6. Disadvantages
It has not been work on online.
Not cost effective.
It cannot be reduce communication overhead.
6
7. Proposed System
An efficient technique is developed to facilitate record
linkage decisions in a distributed, online setting.
A matching tree is developed for attribute acquisition
based on sequential decision making.
The proposed techniques reduce the communication
overhead considerably, and the linkage performance is
assured to be at the same level as the traditional
approach.
7
8. Advantages
Reducing the communication overhead in a distributed
environment cost effective model.
It is cost effective model.
8
10. Principles of matching tree
Input selection
Assume that we are at some node of the
tree and are trying to decide how to branch
from there. At that point , we would be
interested in finding the next best attribute to
be acquired from the set of remaining
attributes.
Stopping
The stopping decision is made when no
realisation of the remaining attributes can
sufficiently revise the current matching
probability so that the matching decision
changes
10
11. Tree based linkage techniques
◦ Here we develop efficient online record linkage techniques
based on the matching tree
◦ The first two stages in this process are performed offline,
using the training data.
◦ Once the matching tree has been built, the online linkage is
done as the final step.
◦ We can now characterize the different techniques that can
be employed in the last step.
◦ Given a local enquiry record, the ultimate goal of any
linkage technique is to identify and fetch all the records
from the remote site that have a matching probability
◦ The partitioning itself can be done in one of two possible
ways:
1) Sequential
2) Concurrent
11
12. Sequential partitioning
The set of remote records is partitioned
recursively, till we obtain the desired
partition of all the relevant records.
This partitioning can be done in one of
two ways:
i). Sequential Attribute Acquisition
ii). Sequential Identifier Acquisition
12
13. Concurrent Partitioning
The tree is used to formulate a database
query that selects the relevant remote
records directly, in one single step.
Once the relevant records are identified,
all their attribute values are transferred
13
14. Record Linkage
Record linkage refers to the task of
finding records in a data set that refer to the
same entity across different data sources (e.g.,
data files, books, websites, databases).
Record linkage is necessary when joining data
sets based on entities that may or may not
share a common identifier.
Record linkage has applications in customer
systems for marketing, relationship
management, fraud detection, law
enforcement and government administration
14
18. Multiple Source Data
Databases is becoming an increasingly important part of
many data mining projects, as often data from multiple
sources needs to be matched in order to enrich data or
improve its quality.
The data needed to support these decisions are often scattered
in heterogeneous distributed databases.
In such cases, it maybe necessary to link records in multiple
databases so that one can consolidate and use the data
pertaining to the same real world entity.
Entity matching is a crucial task for data integration and data
cleaning. It is the task of identifying entities (objects, data
instances) referring to the same real-world entity.
The entity matching problem arises when there is no common
identifier across the heterogeneous data sources.
18
19. Detecting Overhead by Matching
Tree
To integrate or link the data stored in
heterogeneous data sources, a critical
problem is entity matching, i.e., matching
records representing corresponding
entities in the real world, across the
sources.
In this paper, we describe how this
method can be applied in entity matching
rules from heterogeneous databases.
19
20. Overheads Eliminated
We develop a matching tree, similar to
a decision tree, and use it to reduce
the communication overhead
significantly.
The online record linkage process
become more efficient by reducing the
communication overhead in a
distributed environment.
20
25. Conclusion
Record linkage is an important issue in
heterogeneous database systems where
the records representing the same real-
world entity type are identified using
different identifiers in different databases.
In the absence of a common identifier, it
is often difficult to find records in a
remote database that are similar to a local
enquiry record.
25
26. References
A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, “Duplicate
Record Detection: A Survey,” IEEE Trans.
Knowledge and Data Eng.,vol. 19, no. 1, pp. 1-16, Jan. 2007.
B. Tepping, “A Model for Optimum Linkage of Records,” J. Am.
Statistical Assoc., vol. 63, pp. 1321-1332, 1968.
C. Batini, M. Lenzerini, and S.B. Navathe, “A Comparative
Analysis of Methodologies for Database Schema
Integration,”ACM Computing Surveys, vol. 18, no. 4, pp. 323-
364, 1986
D. Dey, “Record Matching in Data Warehouses: A Decision
Model for Data Consolidation,” Operations Research, vol. 51,
no. 2, pp. 240-254,
D. Dey, S. Sarkar, and P. De, “A Distance-Based Approach to
Entity Reconciliation in Heterogeneous Databases,” IEEE
Trans.Knowledge and Data Eng., vol. 14, no. 3, pp. 567-582,
May/June 2002.
26