online Record Linkage

Efficient Techniques for Online
Record Linkage

Guided By,
Mrs.k.Sujatha B.E.,M.Tech.,(Ph.D..,)

Presented by,
D. Angelin chitra,
A. Jainambu sariba,
S. Sathiya priya,
K. Sridevi.

1

Overview
Introduction
LiteratureReview
Record Linkage
System Design
Modules
Screen Shots
Conclusion

2

Introduction
 Databases frequently contain duplicate fields and
records that refer to the same real-world entity.
 The data needed to support these decisions are often
scattered in heterogeneous distributed databases.
 Heterogeneous databases are usually designed and
managed by different organizations there may be no
common candidate key for linking the records.
 If the databases use the same set of design standards
linking can easily be done using the primary key.
 In this project we are developing a software which
will produce the relevant links from heterogenous
database with reference to the keyword.

3

ABSTRACT
 Matching records that refer to the same entity across databases is
becoming an increasingly important part of many data mining
projects, as often data from multiple sources needs to be matched in
order to enrich data or improve its quality.
 Record linkage is the computation of the associations among
records of multiple databases.
 Matching data from heterogeneous data source has been a real
problem.
 Statistical record linkage techniques could be used for resolving
this problem but it causes communication bottleneck in a
distributed environment.
 A matching tree is used to overcome communication overhead and
give matching decision as obtained using the conventional linkage
technique.

4

Literature Review

Existing System
 No common candidate key for linking the records in
Heterogeneous databases.
 It is possible to use common non key attributes to access
the heterogeneous database, but the result obtained using
these attributes may not always be accurate.
 When the matching records reside at a remote site,
existing techniques cannot be directly applied because
they would involve transferring the entire remote
relation, thereby incurring a huge communication
overhead.
 Different ranking algorithm are used for the search
which may be time consuming.
5

Disadvantages

 It has not been work on online.
 Not cost effective.
 It cannot be reduce communication overhead.

6

Proposed System
 An efficient technique is developed to facilitate record
linkage decisions in a distributed, online setting.
 A matching tree is developed for attribute acquisition
based on sequential decision making.
 The proposed techniques reduce the communication
overhead considerably, and the linkage performance is
assured to be at the same level as the traditional
approach.

7

Advantages

 Reducing the communication overhead in a distributed
environment cost effective model.
 It is cost effective model.

8

Sequential record linkage

9

Principles of matching tree
Input selection
Assume that we are at some node of the
tree and are trying to decide how to branch
from there. At that point , we would be
interested in finding the next best attribute to
be acquired from the set of remaining
attributes.
Stopping
The stopping decision is made when no
realisation of the remaining attributes can
sufficiently revise the current matching
probability so that the matching decision
changes

10

Tree based linkage techniques
◦ Here we develop efficient online record linkage techniques
based on the matching tree
◦ The first two stages in this process are performed offline,
using the training data.
◦ Once the matching tree has been built, the online linkage is
done as the final step.
◦ We can now characterize the different techniques that can
be employed in the last step.
◦ Given a local enquiry record, the ultimate goal of any
linkage technique is to identify and fetch all the records
from the remote site that have a matching probability
◦ The partitioning itself can be done in one of two possible
ways:
1) Sequential
2) Concurrent

11

Sequential partitioning
The set of remote records is partitioned
recursively, till we obtain the desired
partition of all the relevant records.
This partitioning can be done in one of
two ways:
i). Sequential Attribute Acquisition
ii). Sequential Identifier Acquisition

12

Concurrent Partitioning
The tree is used to formulate a database
query that selects the relevant remote
records directly, in one single step.
Once the relevant records are identified,
all their attribute values are transferred

13

Record Linkage
Record linkage refers to the task of
finding records in a data set that refer to the
same entity across different data sources (e.g.,
data files, books, websites, databases).
Record linkage is necessary when joining data
sets based on entities that may or may not
share a common identifier.
Record linkage has applications in customer
systems for marketing, relationship
management, fraud detection, law
enforcement and government administration

14

Modules

 Multiple Source Data.
 Detecting Overhead By Matching Tree.
 Overheads Eliminated.

16

MODULE DESCRIPTION

17

Multiple Source Data
 Databases is becoming an increasingly important part of
many data mining projects, as often data from multiple
sources needs to be matched in order to enrich data or
improve its quality.
 The data needed to support these decisions are often scattered
in heterogeneous distributed databases.
 In such cases, it maybe necessary to link records in multiple
databases so that one can consolidate and use the data
pertaining to the same real world entity.
 Entity matching is a crucial task for data integration and data
cleaning. It is the task of identifying entities (objects, data
instances) referring to the same real-world entity.
 The entity matching problem arises when there is no common
identifier across the heterogeneous data sources.

18

Detecting Overhead by Matching
Tree
To integrate or link the data stored in
heterogeneous data sources, a critical
problem is entity matching, i.e., matching
records representing corresponding
entities in the real world, across the
sources.
In this paper, we describe how this
method can be applied in entity matching
rules from heterogeneous databases.

19

Overheads Eliminated
We develop a matching tree, similar to
a decision tree, and use it to reduce
the communication overhead
significantly.
The online record linkage process
become more efficient by reducing the
communication overhead in a
distributed environment.

20

Screen shots
Login page

21

Registration page

22

Conclusion
Record linkage is an important issue in
heterogeneous database systems where
the records representing the same real-
world entity type are identified using
different identifiers in different databases.
In the absence of a common identifier, it
is often difficult to find records in a
remote database that are similar to a local
enquiry record.

25

References
 A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, “Duplicate
Record Detection: A Survey,” IEEE Trans.
Knowledge and Data Eng.,vol. 19, no. 1, pp. 1-16, Jan. 2007.
 B. Tepping, “A Model for Optimum Linkage of Records,” J. Am.
Statistical Assoc., vol. 63, pp. 1321-1332, 1968.
 C. Batini, M. Lenzerini, and S.B. Navathe, “A Comparative
Analysis of Methodologies for Database Schema
Integration,”ACM Computing Surveys, vol. 18, no. 4, pp. 323-
364, 1986
 D. Dey, “Record Matching in Data Warehouses: A Decision
Model for Data Consolidation,” Operations Research, vol. 51,
no. 2, pp. 240-254,
 D. Dey, S. Sarkar, and P. De, “A Distance-Based Approach to
Entity Reconciliation in Heterogeneous Databases,” IEEE
Trans.Knowledge and Data Eng., vol. 14, no. 3, pp. 567-582,
May/June 2002.

26

online Record Linkage

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (16)

Similar a online Record Linkage

Similar a online Record Linkage (20)

Último

Último (20)

online Record Linkage