Entity matching and entity resolution are becoming more important disciplines in data management over time, based on increasing number of data sources that should be addressed in economy that is undergoing digital transformation process, growing data volumes and increasing requirements related to data privacy. Data matching process is also called record linkage, entity matching or entity resolution in some published works. For long time research about the process was focused on matching entities from same dataset (i.e. deduplication) or from two datasets. Different algorithms used for matching different types of attributes were described in the literature, developed and implemented in data matching and data cleansing platforms. Entity resolution is element of larger entity integration process that include data acquisition, data profiling, data cleansing, schema alignment, data matching and data merge (fusion).
We can use motivating example of global pharmaceutical company with offices in more than 60 countries worldwide that migrated customer data from various legacy systems in different countries to new common CRM system in the cloud. Migration was phased by regions and countries, with new sources and data incrementally added and merged with data already migrated in previous phases. Entity integration in such case require deep understanding of data architectures, data content and each step of the process. Even with such deep understanding, design and implementation of the solution require many iterations in development process that consume human resources, time and financial resources. Reducing the number of iterations by automating and optimizing steps in the process can save vast amount of resources. There is a lot of available literature addressing any of the steps in the process, proposing different options for improvement of results or processing optimization, but the whole process still require a lot of human work and subject matter specific knowledge and many iterations to produce results that will have high F-measure (both high precision and recall). Most of the algorithms used in the various steps of the process are Human in the loop (HITL) algorithms that require human interaction. Human is always part of the simulation and consequently influences the outcome.
This paper is a part of the work in progress aimed to define conceptual framework that will try to automate and optimize some steps of entity integration process and try to reduce requirements for human influence in the process. In this paper focus will be on conceptual process definition, recommended data architecture and use of existing open source solutions for entity integration process automation and optimization.
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Conceptual framework for entity integration from multiple data sources - Drazen Orescanin
1. Computer Science
Conceptual Framework for entity
integration from multiple data sources
Dražen Oreščanin
Data Science Conference 4.0
Belgrade, 18.9.2018
3. Computer Science
• Digital transformation
• Growing data volumes
• Increasing requirements related to data privacy
4. Computer Science
What is entity resolution and integration?
• Entity resolution is an operational intelligence process, whereby
organizations can connect data from disparate sources with a view to
understanding possible entity matches and non-obvious
relationships across multiple data silos. It analyzes all of the
information relating to individuals and/or entities from multiple
sources of data, and then applies likelihood and probability scoring
to determine which identities are a match and what, if any, non-
obvious relationships exist between those identities.
• Entity resolution is element of larger entity integration process that
include data acquisition, data profiling, data cleansing, schema
alignment, data matching and data fusion
5. Computer Science
Entity resolution and integration
james smith
1008 6th avenue suite 7
Manhattten, newyourk 10002
First Name:
Last Name:
AddressL1:
AddressL2:
City:
State:
Zip Code:
First Name:
Last Name:
AddressL1: 1008 Avenues of the Americas
AddressL2: Suite 7
City: Manhattan
State:
Zip Code:
First Name: James
Last Name: Smith
AddressL1: 1008 Avenues of the Americas
AddressL2: Suite 7
City: Manhattan
State: New York
Zip Code: 10018
Jim J. Smyth
Manhattan, NY 10018
jsmyth@mywork.com
(212) 755-2551
Source 2
First Name: Jim
Mid Name: J.
Last Name: Smyth
AddressL1: 1008 Avenues of the Americas
AddressL2: Suite 7
City: Manhattan
State: New York
Zip Code: 10018
Phone: (212) 755-2551
Email: jsmyth@mywork.com
First Name: Jim
Mid Name: J.
Last Name: Smyth
AddressL1: 1008 Avenues of the Americas
AddressL2: Suite 7
City: Manhattan
State: New York
Zip Code: 10018-5402
Longitude: 40.7325525
Latitude: -74.004970
Phone: (212) 755-2551
Email: jsmyth@mywork.com
C_Category: Affluent Couples & Families
C_Group: Affluent Families
Profile
Parse
Correct
Standardize
Match
Merge
Enhance
Source 1
Cleansing
Fusion
6. Computer Science
OK, and what is the problem?
• Data is stored in many different locations, sources and formats …
• … and it is not easy to find all data that describe same entity …
• … and it is even harder to resolve and consolidate that data in one
golden record describing the same entity …
• … respecting different data privacy regulations …
• … and make sure that this consolidated data will be regularly updated
from new data in existing sources and from new sources
7. Computer Science
Data Matching process
• Generic process for matching data
from two datasets / databases
• Indexing and Matching are based on
rules implemented in algorithms
• Basic process that is always used as a
base entity integration
Source: Peter Christen, Data Matching - Concepts
and Techniques for Record Linkage, Entity Resolution,
and Duplicate Detection, Springer Publishing, 2012.
8. Computer Science
Requirements for data matching
• Effectiveness: The main goal of entity matching is to achieve a high-
quality match result with respect to recall and precision
• Efficiency: Entity matching should be fast even for voluminous
datasets
• Genericity, offline/online matching
• Low manual effort/self-tuning
Source: H. Köpcke and E. Rahm, “Frameworks for entity matching: A comparison,” Data Knowl. Eng.,
vol. 69, no. 2, pp. 197–210, 2010.
9. Computer Science
Data matching research challenges
• Develop new and better algorithms for blocking
and matching
• Improve F1 score and processing performance
• Many research teams are working on distinct
algorithms / elements
• There is no complete „big picture”
10. Computer Science
Data matching real life challenges
• Matching of more than two datasets
• Matching of datasets with undefined semantics
• Schema alignment of datasets with different structure
• Matching of real-time and streaming data
• Incremental matching
• Changes of matching rules
• Performance of matching large datasets
11. Computer Science
Human-in-the-loop (HITL)
• Humans are involved in a cycle where they train, tune and test a
particular algorithm
• Entity resolution and data matching are HITL problems!
• Human interaction is required for:
• Data labeling
• Algorithm selection
• Algorithm tuning
• Testing and validation of results
12. Computer Science
Some questions about important stuff
• How important is to have 1% increase in F1 score or 5% improvement
in performance of matching algorithm implemented in Python?
• What will be the savings and improvements in time needed to deliver
the solution for business issue?
• Do business users know what they need to do, what is the content of
the data and what algorithms they should use?
• Problems that users are facing in development (precision) and
production (performance) are different
• Academia and real life activities in many cases are not synchronized!
13. Computer Science
Magellan project
• Open source EM system developed by the team at University of
Wisconsin, Madison
• https://sites.google.com/site/anhaidgroup/projects/magellan
• Magellan: Toward Building Entity Matching Management Systems
(VLDB, 2016)
• Human-in-the-Loop Challenges for Entity Matching: A Midterm
Report (HILDA, 2017)
14. Computer Science
Magellan findings and recommendations
• End users need a Step-by-Step & End-to-End How-To guide
• Tools for Pain Points (sampling, debugging…) should be developed
• Tools in the Loop shall be combined with HITL activities
• Develop a How-To Guide for a concrete complex real-life scenarios
Source: Doan, A. at al: Human-
in-the-Loop Challenges for Entity
Matching: A Midterm Report,
HILDA, 2017
15. Computer Science
Frameworks for entity matching
• There are several existing frameworks focused on matching two
datasets (algorithms for blocking and matching)
• Future work on frameworks should address other important steps in
the process to create scenarios and tools for complete solution that
will enable business users to solve their problems in fast and efficient
way, without need to know programming and algorithms
16. Computer Science
Scenarios are complex!
• Search engines
• Legacy and cloud migrations
• … it is much more than matching two datasets in real life:
• Schema alignment
• Data cleansing and preparation
• Order of resolution
• Incremental processing
• …
17. Computer Science
Real life example
Global pharmaceutical company with offices in more than 60 countries
worldwide has migrated customer data from various legacy systems in
different countries to new common CRM system in the cloud.
Migration was phased by regions and countries, with new sources and
data incrementally added and merged with data already migrated in
previous phases. Challenges included:
• Different source schemas – source datasets with different attributes
• Different levels od data quality
• Different sizes of datasets by several orders of magnitude
• Different languages
18. Computer Science
Order of resolution challenge
• You have many datasets that shall be resolved in initial processing, or
new datasets for incremental processing
• How to determine order of resolution, i.e. which datasets shall be
matched and resolved first to get best score and performance?
19. Computer Science
Simplified sample scenario
• Defined target schema
• Three initial source datasets with various attributes
• Attributes common to more than one dataset are used for matching
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
S1 x x x x x x x
S2 x x x x x x
S3 x x x
Dataset Attributes Size
S1 7 250
S2 6 500
S3 3 1.000
M 10 1.750
20. Computer Science
Important attributes of datasets
• Source dataset that has more attributes will be better candidate to be matched
earlier in the process - completeness
• When matching two datasets, if we have more common attributes that we can
match in both datasets, probability to get higher scores will be higher - overlap
• Various datasets can have different numbers of records. If we match large dataset
many times, that will require more computing power. For optimal processing
smaller datasets should be matched first – inverse size
• We need to take into account quality and completeness of data in each source
dataset and datasets with higher quality should be matched first - accuracy
21. Computer Science
Source dataset completeness
Source dataset completeness factor CSi of dataset Si is ratio ni / m of number if
attributes in source schema Si that can be mapped to attributes in unified schema
and number of attributes in unified schema M.
CSi = |Si| / |M|
Source dataset completeness factors:
• CS1 = 7 / 10 = 0,7
• CS2 = 6 / 10 = 0,6
• CS3 = 3 / 10 = 0,3
22. Computer Science
Source dataset overlap factor
Source dataset overlap factor OSij between two source datasets is number of
common attributes in both source datasets over number of attributes in dataset
with more attributes.
OSij = (|Si|∩|Sj|) / max (|Si|,|Sj|)
Source dataset overlap factors:
• OS12 = 4 / 7 = 0,5714
• OS23 = 2 / 6 = 0,3333
• OS13 = 0 / 7 = 0
23. Computer Science
Source dataset inverse size factor
Source dataset inverse size factor ISSi of dataset Si is inverse relative number of
records in Si to sum of number of records in all source datasets, where smaller
datasets will be scored higher. Merge coefficient µ is used for correction for number
of assumed duplicates in merging process.
I𝑆 𝑆𝑖 = 1 − 𝑙𝑒𝑛 𝑆𝑖 /( 𝑗=1
𝑛
len Sj × µ)
Source dataset reverse size factors (µ = 1):
• ISS1 = 1 – (250 / 1750) = 0,8571
• ISS2 = 1 – (500 / 1750) = 0,7143
• ISS3 = 1 – (1000 / 1750) = 0,4286
24. Computer Science
Source dataset accuracy factor
Source dataset accuracy factor ASi ∈ (0..1) is correction factor used to weight
quality of data in source dataset, with 1 representing data with highest possible
quality.
Dataset accuracy factor will be optionally used to ponder completeness, overlap
and reverse size of source datasets to establish optimal order of resolution.
Accuracy can be retreived from team responsible for data source or it can be
calcualted based on source profiling scores.
25. Computer Science
Order of resolution algorithm
• Based on defined dataset attributes,
algorithm is finding optimal order of
resolution
• Each two resolved datasets are removed
from set of datasets and resulting dataset is
added
• Order of resolution is agnostic to blocking
and matching algorithms used for entity
resolution
26. Computer Science
Work in progress
• Testing and experiments
• Publication
• Integration with other processes (schema discovery, data cleansing,
performance optimization, incremental resolution)
27. Computer Science
Conclusions
• Entity resolution is widely used and complex process
• Academic and real-life challenges are diverse
• Lot of work should be invested in building HITL tools and scenarios
that will make real-leaf problems easier and faster to solve