Conceptual framework for entity integration from multiple data sources - Drazen Orescanin

Computer Science
Conceptual Framework for entity
integration from multiple data sources
Dražen Oreščanin
Data Science Conference 4.0
Belgrade, 18.9.2018

Computer Science
• Digital transformation
• Growing data volumes
• Increasing requirements related to data privacy

Computer Science
What is entity resolution and integration?
• Entity resolution is an operational intelligence process, whereby
organizations can connect data from disparate sources with a view to
understanding possible entity matches and non-obvious
relationships across multiple data silos. It analyzes all of the
information relating to individuals and/or entities from multiple
sources of data, and then applies likelihood and probability scoring
to determine which identities are a match and what, if any, non-
obvious relationships exist between those identities.
• Entity resolution is element of larger entity integration process that
include data acquisition, data profiling, data cleansing, schema
alignment, data matching and data fusion

Computer Science
Entity resolution and integration
james smith
1008 6th avenue suite 7
Manhattten, newyourk 10002
First Name:
Last Name:
AddressL1:
AddressL2:
City:
State:
Zip Code:
First Name:
Last Name:
AddressL1: 1008 Avenues of the Americas
AddressL2: Suite 7
City: Manhattan
State:
Zip Code:
First Name: James
Last Name: Smith
AddressL2: Suite 7
City: Manhattan
State: New York
Zip Code: 10018
Jim J. Smyth
Manhattan, NY 10018
jsmyth@mywork.com
(212) 755-2551
Source 2
First Name: Jim
Mid Name: J.
Last Name: Smyth
AddressL2: Suite 7
City: Manhattan
State: New York
Zip Code: 10018
Phone: (212) 755-2551
Email: jsmyth@mywork.com
First Name: Jim
Mid Name: J.
Last Name: Smyth
AddressL2: Suite 7
City: Manhattan
State: New York
Zip Code: 10018-5402
Longitude: 40.7325525
Latitude: -74.004970
Phone: (212) 755-2551
Email: jsmyth@mywork.com
C_Category: Affluent Couples & Families
C_Group: Affluent Families
Profile
Parse
Correct
Standardize
Match
Merge
Enhance
Source 1
Cleansing
Fusion

Computer Science
OK, and what is the problem?
• Data is stored in many different locations, sources and formats …
• … and it is not easy to find all data that describe same entity …
• … and it is even harder to resolve and consolidate that data in one
golden record describing the same entity …
• … respecting different data privacy regulations …
• … and make sure that this consolidated data will be regularly updated
from new data in existing sources and from new sources

Computer Science
Data Matching process
• Generic process for matching data
from two datasets / databases
• Indexing and Matching are based on
rules implemented in algorithms
• Basic process that is always used as a
base entity integration
Source: Peter Christen, Data Matching - Concepts
and Techniques for Record Linkage, Entity Resolution,
and Duplicate Detection, Springer Publishing, 2012.

Computer Science
Requirements for data matching
• Effectiveness: The main goal of entity matching is to achieve a high-
quality match result with respect to recall and precision
• Efficiency: Entity matching should be fast even for voluminous
datasets
• Genericity, offline/online matching
• Low manual effort/self-tuning
Source: H. Köpcke and E. Rahm, “Frameworks for entity matching: A comparison,” Data Knowl. Eng.,
vol. 69, no. 2, pp. 197–210, 2010.

Computer Science
Data matching research challenges
• Develop new and better algorithms for blocking
and matching
• Improve F1 score and processing performance
• Many research teams are working on distinct
algorithms / elements
• There is no complete „big picture”

Computer Science
Data matching real life challenges
• Matching of more than two datasets
• Matching of datasets with undefined semantics
• Schema alignment of datasets with different structure
• Matching of real-time and streaming data
• Incremental matching
• Changes of matching rules
• Performance of matching large datasets

Computer Science
Human-in-the-loop (HITL)
• Humans are involved in a cycle where they train, tune and test a
particular algorithm
• Entity resolution and data matching are HITL problems!
• Human interaction is required for:
• Data labeling
• Algorithm selection
• Algorithm tuning
• Testing and validation of results

Computer Science
Some questions about important stuff
• How important is to have 1% increase in F1 score or 5% improvement
in performance of matching algorithm implemented in Python?
• What will be the savings and improvements in time needed to deliver
the solution for business issue?
• Do business users know what they need to do, what is the content of
the data and what algorithms they should use?
• Problems that users are facing in development (precision) and
production (performance) are different
• Academia and real life activities in many cases are not synchronized!

Computer Science
Magellan project
• Open source EM system developed by the team at University of
Wisconsin, Madison
• https://sites.google.com/site/anhaidgroup/projects/magellan
• Magellan: Toward Building Entity Matching Management Systems
(VLDB, 2016)
• Human-in-the-Loop Challenges for Entity Matching: A Midterm
Report (HILDA, 2017)

Computer Science
Magellan findings and recommendations
• End users need a Step-by-Step & End-to-End How-To guide
• Tools for Pain Points (sampling, debugging…) should be developed
• Tools in the Loop shall be combined with HITL activities
• Develop a How-To Guide for a concrete complex real-life scenarios
Source: Doan, A. at al: Human-
in-the-Loop Challenges for Entity
Matching: A Midterm Report,
HILDA, 2017

Computer Science
Frameworks for entity matching
• There are several existing frameworks focused on matching two
datasets (algorithms for blocking and matching)
• Future work on frameworks should address other important steps in
the process to create scenarios and tools for complete solution that
will enable business users to solve their problems in fast and efficient
way, without need to know programming and algorithms

Computer Science
Scenarios are complex!
• Search engines
• Legacy and cloud migrations
• … it is much more than matching two datasets in real life:
• Schema alignment
• Data cleansing and preparation
• Order of resolution
• Incremental processing
• …

Computer Science
Real life example
Global pharmaceutical company with offices in more than 60 countries
worldwide has migrated customer data from various legacy systems in
different countries to new common CRM system in the cloud.
Migration was phased by regions and countries, with new sources and
data incrementally added and merged with data already migrated in
previous phases. Challenges included:
• Different source schemas – source datasets with different attributes
• Different levels od data quality
• Different sizes of datasets by several orders of magnitude
• Different languages

Computer Science
Order of resolution challenge
• You have many datasets that shall be resolved in initial processing, or
new datasets for incremental processing
• How to determine order of resolution, i.e. which datasets shall be
matched and resolved first to get best score and performance?

Computer Science
Simplified sample scenario
• Defined target schema
• Three initial source datasets with various attributes
• Attributes common to more than one dataset are used for matching
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
S1 x x x x x x x
S2 x x x x x x
S3 x x x
Dataset Attributes Size
S1 7 250
S2 6 500
S3 3 1.000
M 10 1.750

Computer Science
Important attributes of datasets
• Source dataset that has more attributes will be better candidate to be matched
earlier in the process - completeness
• When matching two datasets, if we have more common attributes that we can
match in both datasets, probability to get higher scores will be higher - overlap
• Various datasets can have different numbers of records. If we match large dataset
many times, that will require more computing power. For optimal processing
smaller datasets should be matched first – inverse size
• We need to take into account quality and completeness of data in each source
dataset and datasets with higher quality should be matched first - accuracy

Computer Science
Source dataset completeness
Source dataset completeness factor CSi of dataset Si is ratio ni / m of number if
attributes in source schema Si that can be mapped to attributes in unified schema
and number of attributes in unified schema M.
CSi = |Si| / |M|
Source dataset completeness factors:
• CS1 = 7 / 10 = 0,7
• CS2 = 6 / 10 = 0,6
• CS3 = 3 / 10 = 0,3

Computer Science
Source dataset overlap factor
Source dataset overlap factor OSij between two source datasets is number of
common attributes in both source datasets over number of attributes in dataset
with more attributes.
OSij = (|Si|∩|Sj|) / max (|Si|,|Sj|)
Source dataset overlap factors:
• OS12 = 4 / 7 = 0,5714
• OS23 = 2 / 6 = 0,3333
• OS13 = 0 / 7 = 0

Computer Science
Source dataset inverse size factor
Source dataset inverse size factor ISSi of dataset Si is inverse relative number of
records in Si to sum of number of records in all source datasets, where smaller
datasets will be scored higher. Merge coefficient µ is used for correction for number
of assumed duplicates in merging process.
I𝑆 𝑆𝑖 = 1 − 𝑙𝑒𝑛 𝑆𝑖 /( 𝑗=1
𝑛
len Sj × µ)
Source dataset reverse size factors (µ = 1):
• ISS1 = 1 – (250 / 1750) = 0,8571
• ISS2 = 1 – (500 / 1750) = 0,7143
• ISS3 = 1 – (1000 / 1750) = 0,4286

Computer Science
Source dataset accuracy factor
Source dataset accuracy factor ASi ∈ (0..1) is correction factor used to weight
quality of data in source dataset, with 1 representing data with highest possible
quality.
Dataset accuracy factor will be optionally used to ponder completeness, overlap
and reverse size of source datasets to establish optimal order of resolution.
Accuracy can be retreived from team responsible for data source or it can be
calcualted based on source profiling scores.

Computer Science
Order of resolution algorithm
• Based on defined dataset attributes,
algorithm is finding optimal order of
resolution
• Each two resolved datasets are removed
from set of datasets and resulting dataset is
added
• Order of resolution is agnostic to blocking
and matching algorithms used for entity
resolution

Computer Science
Work in progress
• Testing and experiments
• Publication
• Integration with other processes (schema discovery, data cleansing,
performance optimization, incremental resolution)

Computer Science
Conclusions
• Entity resolution is widely used and complex process
• Academic and real-life challenges are diverse
• Lot of work should be invested in building HITL tools and scenarios
that will make real-leaf problems easier and faster to solve

Computer Science
Dražen Oreščanin
drazen.orescanin@fer.hr
drazen.orescanin@inteligencija.com

Conceptual framework for entity integration from multiple data sources - Drazen Orescanin

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Conceptual framework for entity integration from multiple data sources - Drazen Orescanin

Similar a Conceptual framework for entity integration from multiple data sources - Drazen Orescanin (20)

Más de Institute of Contemporary Sciences

Más de Institute of Contemporary Sciences (20)

Último

Último (20)

Conceptual framework for entity integration from multiple data sources - Drazen Orescanin