SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Computer Science
Conceptual Framework for entity
integration from multiple data sources
Dražen Oreščanin
Data Science Conference 4.0
Belgrade, 18.9.2018
Computer Science
Computer Science
• Digital transformation
• Growing data volumes
• Increasing requirements related to data privacy
Computer Science
What is entity resolution and integration?
• Entity resolution is an operational intelligence process, whereby
organizations can connect data from disparate sources with a view to
understanding possible entity matches and non-obvious
relationships across multiple data silos. It analyzes all of the
information relating to individuals and/or entities from multiple
sources of data, and then applies likelihood and probability scoring
to determine which identities are a match and what, if any, non-
obvious relationships exist between those identities.
• Entity resolution is element of larger entity integration process that
include data acquisition, data profiling, data cleansing, schema
alignment, data matching and data fusion
Computer Science
Entity resolution and integration
james smith
1008 6th avenue suite 7
Manhattten, newyourk 10002
First Name:
Last Name:
AddressL1:
AddressL2:
City:
State:
Zip Code:
First Name:
Last Name:
AddressL1: 1008 Avenues of the Americas
AddressL2: Suite 7
City: Manhattan
State:
Zip Code:
First Name: James
Last Name: Smith
AddressL1: 1008 Avenues of the Americas
AddressL2: Suite 7
City: Manhattan
State: New York
Zip Code: 10018
Jim J. Smyth
Manhattan, NY 10018
jsmyth@mywork.com
(212) 755-2551
Source 2
First Name: Jim
Mid Name: J.
Last Name: Smyth
AddressL1: 1008 Avenues of the Americas
AddressL2: Suite 7
City: Manhattan
State: New York
Zip Code: 10018
Phone: (212) 755-2551
Email: jsmyth@mywork.com
First Name: Jim
Mid Name: J.
Last Name: Smyth
AddressL1: 1008 Avenues of the Americas
AddressL2: Suite 7
City: Manhattan
State: New York
Zip Code: 10018-5402
Longitude: 40.7325525
Latitude: -74.004970
Phone: (212) 755-2551
Email: jsmyth@mywork.com
C_Category: Affluent Couples & Families
C_Group: Affluent Families
Profile
Parse
Correct
Standardize
Match
Merge
Enhance
Source 1
Cleansing
Fusion
Computer Science
OK, and what is the problem?
• Data is stored in many different locations, sources and formats …
• … and it is not easy to find all data that describe same entity …
• … and it is even harder to resolve and consolidate that data in one
golden record describing the same entity …
• … respecting different data privacy regulations …
• … and make sure that this consolidated data will be regularly updated
from new data in existing sources and from new sources
Computer Science
Data Matching process
• Generic process for matching data
from two datasets / databases
• Indexing and Matching are based on
rules implemented in algorithms
• Basic process that is always used as a
base entity integration
Source: Peter Christen, Data Matching - Concepts
and Techniques for Record Linkage, Entity Resolution,
and Duplicate Detection, Springer Publishing, 2012.
Computer Science
Requirements for data matching
• Effectiveness: The main goal of entity matching is to achieve a high-
quality match result with respect to recall and precision
• Efficiency: Entity matching should be fast even for voluminous
datasets
• Genericity, offline/online matching
• Low manual effort/self-tuning
Source: H. Köpcke and E. Rahm, “Frameworks for entity matching: A comparison,” Data Knowl. Eng.,
vol. 69, no. 2, pp. 197–210, 2010.
Computer Science
Data matching research challenges
• Develop new and better algorithms for blocking
and matching
• Improve F1 score and processing performance
• Many research teams are working on distinct
algorithms / elements
• There is no complete „big picture”
Computer Science
Data matching real life challenges
• Matching of more than two datasets
• Matching of datasets with undefined semantics
• Schema alignment of datasets with different structure
• Matching of real-time and streaming data
• Incremental matching
• Changes of matching rules
• Performance of matching large datasets
Computer Science
Human-in-the-loop (HITL)
• Humans are involved in a cycle where they train, tune and test a
particular algorithm
• Entity resolution and data matching are HITL problems!
• Human interaction is required for:
• Data labeling
• Algorithm selection
• Algorithm tuning
• Testing and validation of results
Computer Science
Some questions about important stuff
• How important is to have 1% increase in F1 score or 5% improvement
in performance of matching algorithm implemented in Python?
• What will be the savings and improvements in time needed to deliver
the solution for business issue?
• Do business users know what they need to do, what is the content of
the data and what algorithms they should use?
• Problems that users are facing in development (precision) and
production (performance) are different
• Academia and real life activities in many cases are not synchronized!
Computer Science
Magellan project
• Open source EM system developed by the team at University of
Wisconsin, Madison
• https://sites.google.com/site/anhaidgroup/projects/magellan
• Magellan: Toward Building Entity Matching Management Systems
(VLDB, 2016)
• Human-in-the-Loop Challenges for Entity Matching: A Midterm
Report (HILDA, 2017)
Computer Science
Magellan findings and recommendations
• End users need a Step-by-Step & End-to-End How-To guide
• Tools for Pain Points (sampling, debugging…) should be developed
• Tools in the Loop shall be combined with HITL activities
• Develop a How-To Guide for a concrete complex real-life scenarios
Source: Doan, A. at al: Human-
in-the-Loop Challenges for Entity
Matching: A Midterm Report,
HILDA, 2017
Computer Science
Frameworks for entity matching
• There are several existing frameworks focused on matching two
datasets (algorithms for blocking and matching)
• Future work on frameworks should address other important steps in
the process to create scenarios and tools for complete solution that
will enable business users to solve their problems in fast and efficient
way, without need to know programming and algorithms
Computer Science
Scenarios are complex!
• Search engines
• Legacy and cloud migrations
• … it is much more than matching two datasets in real life:
• Schema alignment
• Data cleansing and preparation
• Order of resolution
• Incremental processing
• …
Computer Science
Real life example
Global pharmaceutical company with offices in more than 60 countries
worldwide has migrated customer data from various legacy systems in
different countries to new common CRM system in the cloud.
Migration was phased by regions and countries, with new sources and
data incrementally added and merged with data already migrated in
previous phases. Challenges included:
• Different source schemas – source datasets with different attributes
• Different levels od data quality
• Different sizes of datasets by several orders of magnitude
• Different languages
Computer Science
Order of resolution challenge
• You have many datasets that shall be resolved in initial processing, or
new datasets for incremental processing
• How to determine order of resolution, i.e. which datasets shall be
matched and resolved first to get best score and performance?
Computer Science
Simplified sample scenario
• Defined target schema
• Three initial source datasets with various attributes
• Attributes common to more than one dataset are used for matching
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
S1 x x x x x x x
S2 x x x x x x
S3 x x x
Dataset Attributes Size
S1 7 250
S2 6 500
S3 3 1.000
M 10 1.750
Computer Science
Important attributes of datasets
• Source dataset that has more attributes will be better candidate to be matched
earlier in the process - completeness
• When matching two datasets, if we have more common attributes that we can
match in both datasets, probability to get higher scores will be higher - overlap
• Various datasets can have different numbers of records. If we match large dataset
many times, that will require more computing power. For optimal processing
smaller datasets should be matched first – inverse size
• We need to take into account quality and completeness of data in each source
dataset and datasets with higher quality should be matched first - accuracy
Computer Science
Source dataset completeness
Source dataset completeness factor CSi of dataset Si is ratio ni / m of number if
attributes in source schema Si that can be mapped to attributes in unified schema
and number of attributes in unified schema M.
CSi = |Si| / |M|
Source dataset completeness factors:
• CS1 = 7 / 10 = 0,7
• CS2 = 6 / 10 = 0,6
• CS3 = 3 / 10 = 0,3
Computer Science
Source dataset overlap factor
Source dataset overlap factor OSij between two source datasets is number of
common attributes in both source datasets over number of attributes in dataset
with more attributes.
OSij = (|Si|∩|Sj|) / max (|Si|,|Sj|)
Source dataset overlap factors:
• OS12 = 4 / 7 = 0,5714
• OS23 = 2 / 6 = 0,3333
• OS13 = 0 / 7 = 0
Computer Science
Source dataset inverse size factor
Source dataset inverse size factor ISSi of dataset Si is inverse relative number of
records in Si to sum of number of records in all source datasets, where smaller
datasets will be scored higher. Merge coefficient µ is used for correction for number
of assumed duplicates in merging process.
I𝑆 𝑆𝑖 = 1 − 𝑙𝑒𝑛 𝑆𝑖 /( 𝑗=1
𝑛
len Sj × µ)
Source dataset reverse size factors (µ = 1):
• ISS1 = 1 – (250 / 1750) = 0,8571
• ISS2 = 1 – (500 / 1750) = 0,7143
• ISS3 = 1 – (1000 / 1750) = 0,4286
Computer Science
Source dataset accuracy factor
Source dataset accuracy factor ASi ∈ (0..1) is correction factor used to weight
quality of data in source dataset, with 1 representing data with highest possible
quality.
Dataset accuracy factor will be optionally used to ponder completeness, overlap
and reverse size of source datasets to establish optimal order of resolution.
Accuracy can be retreived from team responsible for data source or it can be
calcualted based on source profiling scores.
Computer Science
Order of resolution algorithm
• Based on defined dataset attributes,
algorithm is finding optimal order of
resolution
• Each two resolved datasets are removed
from set of datasets and resulting dataset is
added
• Order of resolution is agnostic to blocking
and matching algorithms used for entity
resolution
Computer Science
Work in progress
• Testing and experiments
• Publication
• Integration with other processes (schema discovery, data cleansing,
performance optimization, incremental resolution)
Computer Science
Conclusions
• Entity resolution is widely used and complex process
• Academic and real-life challenges are diverse
• Lot of work should be invested in building HITL tools and scenarios
that will make real-leaf problems easier and faster to solve
Computer Science
Dražen Oreščanin
drazen.orescanin@fer.hr
drazen.orescanin@inteligencija.com

Más contenido relacionado

La actualidad más candente

Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challengesDilpreet kaur Virk
 
Data Mining vs Statistics
Data Mining vs StatisticsData Mining vs Statistics
Data Mining vs StatisticsAndry Alamsyah
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueMehmet Beyaz
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 
The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration James Hendler
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities台灣資料科學年會
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
Issues, challenges, and solutions
Issues, challenges, and solutionsIssues, challenges, and solutions
Issues, challenges, and solutionscsandit
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningJulian Bright
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceIJDKP
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...IJECEIAES
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataIJSTA
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 

La actualidad más candente (20)

Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challenges
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Mining vs Statistics
Data Mining vs StatisticsData Mining vs Statistics
Data Mining vs Statistics
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining Technique
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Issues, challenges, and solutions
Issues, challenges, and solutionsIssues, challenges, and solutions
Issues, challenges, and solutions
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 

Similar a Conceptual framework for entity integration from multiple data sources - Drazen Orescanin

Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 
Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfAbdulrahimShaibuIssa
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overviewjkvr
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Solving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBSolving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBMongoDB
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 
Data Collaboration Stack
Data Collaboration StackData Collaboration Stack
Data Collaboration StackPierre Brunelle
 
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big HaystackBig Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big HaystackPrecisely
 
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesPragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesAmit Sheth
 
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...Denodo
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityPrecisely
 
Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality Engineering Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality Engineering Data Blueprint
 
Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality EngineeringData-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality EngineeringDATAVERSITY
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...Edward Curry
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxsumitkumar600840
 
Zen and the Art of Datanauting
Zen and the Art of DatanautingZen and the Art of Datanauting
Zen and the Art of DatanautingOntologySystems
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Analyst Keynote: Delivering Faster Insights with a Logical Data Fabric in a H...
Analyst Keynote: Delivering Faster Insights with a Logical Data Fabric in a H...Analyst Keynote: Delivering Faster Insights with a Logical Data Fabric in a H...
Analyst Keynote: Delivering Faster Insights with a Logical Data Fabric in a H...Denodo
 

Similar a Conceptual framework for entity integration from multiple data sources - Drazen Orescanin (20)

Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Introduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdfIntroduction to Business and Data Analysis Undergraduate.pdf
Introduction to Business and Data Analysis Undergraduate.pdf
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overview
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Solving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBSolving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDB
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
Data Collaboration Stack
Data Collaboration StackData Collaboration Stack
Data Collaboration Stack
 
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big HaystackBig Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
 
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesPragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
 
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality Engineering Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality Engineering
 
Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality EngineeringData-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality Engineering
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
 
Data mining
Data miningData mining
Data mining
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
Zen and the Art of Datanauting
Zen and the Art of DatanautingZen and the Art of Datanauting
Zen and the Art of Datanauting
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Analyst Keynote: Delivering Faster Insights with a Logical Data Fabric in a H...
Analyst Keynote: Delivering Faster Insights with a Logical Data Fabric in a H...Analyst Keynote: Delivering Faster Insights with a Logical Data Fabric in a H...
Analyst Keynote: Delivering Faster Insights with a Logical Data Fabric in a H...
 

Más de Institute of Contemporary Sciences

Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Institute of Contemporary Sciences
 
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicData Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicInstitute of Contemporary Sciences
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Institute of Contemporary Sciences
 
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment  - Jelena PekezSolving churn challenge in Big Data environment  - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena PekezInstitute of Contemporary Sciences
 
Application of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovApplication of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovInstitute of Contemporary Sciences
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Institute of Contemporary Sciences
 
Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Institute of Contemporary Sciences
 
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Institute of Contemporary Sciences
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Institute of Contemporary Sciences
 
Reality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicReality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicInstitute of Contemporary Sciences
 
Sensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicSensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicInstitute of Contemporary Sciences
 
Prediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionPrediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionInstitute of Contemporary Sciences
 
Using data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentUsing data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentInstitute of Contemporary Sciences
 

Más de Institute of Contemporary Sciences (20)

First 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip PanjevicFirst 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip Panjevic
 
Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...
 
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicData Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
 
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment  - Jelena PekezSolving churn challenge in Big Data environment  - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena Pekez
 
Application of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovApplication of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar Dilov
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
 
Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...
 
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...
 
From Zero to ML Hero for Underdogs - Amir Tabakovic
From Zero to ML Hero for Underdogs  - Amir TabakovicFrom Zero to ML Hero for Underdogs  - Amir Tabakovic
From Zero to ML Hero for Underdogs - Amir Tabakovic
 
Data and data scientists are not equal to money david hoyle
Data and data scientists are not equal to money   david hoyleData and data scientists are not equal to money   david hoyle
Data and data scientists are not equal to money david hoyle
 
The price is right - Tomislav Krizan
The price is right - Tomislav KrizanThe price is right - Tomislav Krizan
The price is right - Tomislav Krizan
 
When it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela CulibrkWhen it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela Culibrk
 
Reality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicReality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos Solujic
 
Sensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicSensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir Brusic
 
Improving Data Quality with Product Similarity Search
Improving Data Quality with Product Similarity SearchImproving Data Quality with Product Similarity Search
Improving Data Quality with Product Similarity Search
 
Prediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionPrediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognition
 
Using data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentUsing data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local government
 
Geospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and ClimateGeospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and Climate
 

Último

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Conceptual framework for entity integration from multiple data sources - Drazen Orescanin

  • 1. Computer Science Conceptual Framework for entity integration from multiple data sources Dražen Oreščanin Data Science Conference 4.0 Belgrade, 18.9.2018
  • 3. Computer Science • Digital transformation • Growing data volumes • Increasing requirements related to data privacy
  • 4. Computer Science What is entity resolution and integration? • Entity resolution is an operational intelligence process, whereby organizations can connect data from disparate sources with a view to understanding possible entity matches and non-obvious relationships across multiple data silos. It analyzes all of the information relating to individuals and/or entities from multiple sources of data, and then applies likelihood and probability scoring to determine which identities are a match and what, if any, non- obvious relationships exist between those identities. • Entity resolution is element of larger entity integration process that include data acquisition, data profiling, data cleansing, schema alignment, data matching and data fusion
  • 5. Computer Science Entity resolution and integration james smith 1008 6th avenue suite 7 Manhattten, newyourk 10002 First Name: Last Name: AddressL1: AddressL2: City: State: Zip Code: First Name: Last Name: AddressL1: 1008 Avenues of the Americas AddressL2: Suite 7 City: Manhattan State: Zip Code: First Name: James Last Name: Smith AddressL1: 1008 Avenues of the Americas AddressL2: Suite 7 City: Manhattan State: New York Zip Code: 10018 Jim J. Smyth Manhattan, NY 10018 jsmyth@mywork.com (212) 755-2551 Source 2 First Name: Jim Mid Name: J. Last Name: Smyth AddressL1: 1008 Avenues of the Americas AddressL2: Suite 7 City: Manhattan State: New York Zip Code: 10018 Phone: (212) 755-2551 Email: jsmyth@mywork.com First Name: Jim Mid Name: J. Last Name: Smyth AddressL1: 1008 Avenues of the Americas AddressL2: Suite 7 City: Manhattan State: New York Zip Code: 10018-5402 Longitude: 40.7325525 Latitude: -74.004970 Phone: (212) 755-2551 Email: jsmyth@mywork.com C_Category: Affluent Couples & Families C_Group: Affluent Families Profile Parse Correct Standardize Match Merge Enhance Source 1 Cleansing Fusion
  • 6. Computer Science OK, and what is the problem? • Data is stored in many different locations, sources and formats … • … and it is not easy to find all data that describe same entity … • … and it is even harder to resolve and consolidate that data in one golden record describing the same entity … • … respecting different data privacy regulations … • … and make sure that this consolidated data will be regularly updated from new data in existing sources and from new sources
  • 7. Computer Science Data Matching process • Generic process for matching data from two datasets / databases • Indexing and Matching are based on rules implemented in algorithms • Basic process that is always used as a base entity integration Source: Peter Christen, Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, Springer Publishing, 2012.
  • 8. Computer Science Requirements for data matching • Effectiveness: The main goal of entity matching is to achieve a high- quality match result with respect to recall and precision • Efficiency: Entity matching should be fast even for voluminous datasets • Genericity, offline/online matching • Low manual effort/self-tuning Source: H. Köpcke and E. Rahm, “Frameworks for entity matching: A comparison,” Data Knowl. Eng., vol. 69, no. 2, pp. 197–210, 2010.
  • 9. Computer Science Data matching research challenges • Develop new and better algorithms for blocking and matching • Improve F1 score and processing performance • Many research teams are working on distinct algorithms / elements • There is no complete „big picture”
  • 10. Computer Science Data matching real life challenges • Matching of more than two datasets • Matching of datasets with undefined semantics • Schema alignment of datasets with different structure • Matching of real-time and streaming data • Incremental matching • Changes of matching rules • Performance of matching large datasets
  • 11. Computer Science Human-in-the-loop (HITL) • Humans are involved in a cycle where they train, tune and test a particular algorithm • Entity resolution and data matching are HITL problems! • Human interaction is required for: • Data labeling • Algorithm selection • Algorithm tuning • Testing and validation of results
  • 12. Computer Science Some questions about important stuff • How important is to have 1% increase in F1 score or 5% improvement in performance of matching algorithm implemented in Python? • What will be the savings and improvements in time needed to deliver the solution for business issue? • Do business users know what they need to do, what is the content of the data and what algorithms they should use? • Problems that users are facing in development (precision) and production (performance) are different • Academia and real life activities in many cases are not synchronized!
  • 13. Computer Science Magellan project • Open source EM system developed by the team at University of Wisconsin, Madison • https://sites.google.com/site/anhaidgroup/projects/magellan • Magellan: Toward Building Entity Matching Management Systems (VLDB, 2016) • Human-in-the-Loop Challenges for Entity Matching: A Midterm Report (HILDA, 2017)
  • 14. Computer Science Magellan findings and recommendations • End users need a Step-by-Step & End-to-End How-To guide • Tools for Pain Points (sampling, debugging…) should be developed • Tools in the Loop shall be combined with HITL activities • Develop a How-To Guide for a concrete complex real-life scenarios Source: Doan, A. at al: Human- in-the-Loop Challenges for Entity Matching: A Midterm Report, HILDA, 2017
  • 15. Computer Science Frameworks for entity matching • There are several existing frameworks focused on matching two datasets (algorithms for blocking and matching) • Future work on frameworks should address other important steps in the process to create scenarios and tools for complete solution that will enable business users to solve their problems in fast and efficient way, without need to know programming and algorithms
  • 16. Computer Science Scenarios are complex! • Search engines • Legacy and cloud migrations • … it is much more than matching two datasets in real life: • Schema alignment • Data cleansing and preparation • Order of resolution • Incremental processing • …
  • 17. Computer Science Real life example Global pharmaceutical company with offices in more than 60 countries worldwide has migrated customer data from various legacy systems in different countries to new common CRM system in the cloud. Migration was phased by regions and countries, with new sources and data incrementally added and merged with data already migrated in previous phases. Challenges included: • Different source schemas – source datasets with different attributes • Different levels od data quality • Different sizes of datasets by several orders of magnitude • Different languages
  • 18. Computer Science Order of resolution challenge • You have many datasets that shall be resolved in initial processing, or new datasets for incremental processing • How to determine order of resolution, i.e. which datasets shall be matched and resolved first to get best score and performance?
  • 19. Computer Science Simplified sample scenario • Defined target schema • Three initial source datasets with various attributes • Attributes common to more than one dataset are used for matching D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 S1 x x x x x x x S2 x x x x x x S3 x x x Dataset Attributes Size S1 7 250 S2 6 500 S3 3 1.000 M 10 1.750
  • 20. Computer Science Important attributes of datasets • Source dataset that has more attributes will be better candidate to be matched earlier in the process - completeness • When matching two datasets, if we have more common attributes that we can match in both datasets, probability to get higher scores will be higher - overlap • Various datasets can have different numbers of records. If we match large dataset many times, that will require more computing power. For optimal processing smaller datasets should be matched first – inverse size • We need to take into account quality and completeness of data in each source dataset and datasets with higher quality should be matched first - accuracy
  • 21. Computer Science Source dataset completeness Source dataset completeness factor CSi of dataset Si is ratio ni / m of number if attributes in source schema Si that can be mapped to attributes in unified schema and number of attributes in unified schema M. CSi = |Si| / |M| Source dataset completeness factors: • CS1 = 7 / 10 = 0,7 • CS2 = 6 / 10 = 0,6 • CS3 = 3 / 10 = 0,3
  • 22. Computer Science Source dataset overlap factor Source dataset overlap factor OSij between two source datasets is number of common attributes in both source datasets over number of attributes in dataset with more attributes. OSij = (|Si|∩|Sj|) / max (|Si|,|Sj|) Source dataset overlap factors: • OS12 = 4 / 7 = 0,5714 • OS23 = 2 / 6 = 0,3333 • OS13 = 0 / 7 = 0
  • 23. Computer Science Source dataset inverse size factor Source dataset inverse size factor ISSi of dataset Si is inverse relative number of records in Si to sum of number of records in all source datasets, where smaller datasets will be scored higher. Merge coefficient µ is used for correction for number of assumed duplicates in merging process. I𝑆 𝑆𝑖 = 1 − 𝑙𝑒𝑛 𝑆𝑖 /( 𝑗=1 𝑛 len Sj × µ) Source dataset reverse size factors (µ = 1): • ISS1 = 1 – (250 / 1750) = 0,8571 • ISS2 = 1 – (500 / 1750) = 0,7143 • ISS3 = 1 – (1000 / 1750) = 0,4286
  • 24. Computer Science Source dataset accuracy factor Source dataset accuracy factor ASi ∈ (0..1) is correction factor used to weight quality of data in source dataset, with 1 representing data with highest possible quality. Dataset accuracy factor will be optionally used to ponder completeness, overlap and reverse size of source datasets to establish optimal order of resolution. Accuracy can be retreived from team responsible for data source or it can be calcualted based on source profiling scores.
  • 25. Computer Science Order of resolution algorithm • Based on defined dataset attributes, algorithm is finding optimal order of resolution • Each two resolved datasets are removed from set of datasets and resulting dataset is added • Order of resolution is agnostic to blocking and matching algorithms used for entity resolution
  • 26. Computer Science Work in progress • Testing and experiments • Publication • Integration with other processes (schema discovery, data cleansing, performance optimization, incremental resolution)
  • 27. Computer Science Conclusions • Entity resolution is widely used and complex process • Academic and real-life challenges are diverse • Lot of work should be invested in building HITL tools and scenarios that will make real-leaf problems easier and faster to solve