SlideShare una empresa de Scribd logo
1 de 28
Mustafa Jarrar
Lecture Notes, Web Data Management (MCOM7348)
University of Birzeit, Palestine
1st Semester, 2013

Data Schema Integration

Dr. Mustafa Jarrar
University of Birzeit
mjarrar@birzeit.edu
www.jarrar.info
Jarrar © 2013

1
Watch this lecture and download the slides from
http://jarrar-courses.blogspot.com/2013/11/web-data-management.html

Jarrar © 2013

2
Data Schema Integration: A simple example
In ORM:
bornIn/

locatedIn/

/WorksIn

Employee

City

Region

Integrated
schema

locatedIn/

Organization

/WorksIn

Employee

Organization

Municipality
locatedIn/

bornIn/
Worker

City
Schema 2

Region
Organization

locatedIn
/

Schema 3

Schema 1
Jarrar © 2013

3
Data Schema Integration: A simple example
Source: Carlo Batini

In ER:
Employee

born

City

in

works
Organization

Region

Integrated
schema

in

Employee
works
Organization

Empoloyee

born

City

in

Schema 2

Schema 1

Region

Municipality

Organization

in

Schema 3
Jarrar © 2013

4
Challenges of Data Schema Integration
Source: Carlo Batini

Schema Integration has two major challenges:
1.  Identification of all portions of schemas that pertain to the
same concept, in such a way to unify such different
representations in the global schema.
2.  Identification, analysis and resolution of the different
types of conflicts (heterogeneities) in different schemas.

Jarrar © 2013

5
Framework for Schema Integration
Local
Schemas

Schemas
Transformation

Transformation
Rules

Schemas
Matching

Matching
Rules

Schemas
Integration

Integration
Rules

Integrated Schema
and mappings

Source: Advances in Object-Oriented Data Modeling, M. P. Papazoglou, S. Spaccapietra, Z. Tari (Eds.), The MIT Press, 2000

Jarrar © 2013

6
Framework for Schema Integration
0. Define the integration strategy
If the number of local schemas to be integrated is large, the order of
schema integration becomes important. Several strategies can be
adopted.
Input: n source schemas
Output: n source schemas + integration strategies
Method used: heuristics
S1

S2

S3

S1

S2

S3 S4

One shot strategy
- Efficient integration
process

S3 S4

S2
IS1

IS1
IS

S1

IS2

IS2
IS

…

IS
Pair at a time strategy
- Priority to most relevant
and stable schemas.

- The integration process is
- Many correspondences
more efficient
between concepts have to
be considered together.
Jarrar © 2013

Balanced Strategy
-e.g.: Production, Marketing,
Sales.
-To be preferred when the
cohesion among schemas is high.
7
Framework for Schema Integration
Source: Stefano Spaccapietra

1. Schema transformation (or Pre-integration)
Input: n source schemas
Output: n source schemas homogeneized
Methods used: Model and Design Homogeneization

Reduce model heterogeneities as much as possible to make
the sources more suitable for integration.
Goal: use a single, common data model and format.
Transformation

Source DBs

Integration

Homogeneized DBs
Jarrar © 2013

DW
8
Schema Transformation
Schema Transformation involves:
•  Data model homogeneization
–  Where all data sources are described using the same data model.

•  Design homogeneization
–  Enforce standard design rules to reduce the number of structural
conflicts (e.g., Normalization: one fact in one place)

•  Reverse Engineering
–  Reverse engineer the schema from existing data (such as COBOL
files, spreadsheets, legacy relational databases, legacy objectoriented databases).

Jarrar © 2013

9
Example of Design homogeneization (Normalization)

ONE TABLE:
R1 (#Student, Name, LastName, #Course, CourseName,
Grade, Date)
Dependencies:
–  #Student à Name, LastName
–  #Course à CourseName
–  #Student #Course à Grade, Date)

Normalized Into 3 Tables: One Fact In One Place:
R11 (#Student, Name, LastName)
R12 (#Course, CourseName)
R13 (#Student, #Course, Grade, Date)

Jarrar © 2013

10
Example of Reverse Engineering
Source: Stefano Spaccapietra

Jarrar © 2013

11
Schema Matching
2. Schema matching (Correspondences investigation)
Input: n source schemas
Output: n source schemas + correspondences
Method used: techniques to discover correspondences

Correspondences relate (schema) elements which describe
the same phenomena of the real world.
–  This step aims at finding and describing all semantic links between
elements of the input schemas and the corresponding data.
–  By doing so, one matches between the schemas to be integrated.
–  This step fixes the conflicts found in the schema.

Jarrar © 2013

12
Semantics of Correspondences
Source: Stefano Spaccapietra

Correspondences relate (schema) elements which describe
the same phenomena of the real world.

Jarrar © 2013

13
Asserting Correspondences
Source: Stefano Spaccapietra

Finding matching correspondences is done through the use of a
rich language for expressing correspondences (matchings).

Example:

S1.Person ≡ S2.Person,
With Corresponding Identifiers: Pin,
With Corresponding Property: name
Jarrar © 2013

14
Automated Matching
•  Fully automated matching is impossible, as a computer process can
hardly make ultimate decisions about the semantics of data.
•  But even partial assistance in discovering of correspondences (to be
confirmed or guided by humans) is beneficial, due to the complexity of
the task.
•  All proposed methods rely on some similarity measures that try to
evaluate the semantic distance between two descriptions.
Some state of the art matching systems
Cupid (Microsoft Research, USA)
FOAM/QOM (University of Karlsruhe, Germany)
OLA (INRIA Rhône-Alpes, France / University of Montreal, Canada)
S-Match (University of Trento, Italy)
… many others
Jarrar © 2013

15
Examples of Correspondences
Source: Stefano Spaccapietra

Jarrar © 2013

16
Examples of Correspondences
Previous example
Employee
/WorksIn

Municipality

locatedIn/
Organization

Organization

Schema 1

Worker

Schema 3

bornIn/

City

locatedIn/

Region

Schema 2
Jarrar © 2013

17
Examples of Correspondences
Source: Stefano Spaccapietra

Jarrar © 2013

18
Schema Integration & Mapping Generation
Source: Carlo Batini

3. Schemas integration and mapping generation
Input: n source schemas + correspondences
Output: integrated schema + mapping rules btw the integrated
schema and input source schemas
Method used: New classification of conflicts + Conflict resolution
transformations
GOAL: Creating an Integrated Schema ( IS ) and the mappings to the
local databases.

Jarrar © 2013

19
GAV and LAV Integration
Research has identified two methods to set up mappings between the
integrated schema and the input schemas:
(1)  GAV (Global As View): proposes to define the integrated schema
as a view over input schemas.
GAV is usually considered simpler and more efficient for processing
queries on the integrated database, but is weaker in supporting
evolution of the global system through addition of new sources.
(2)  LAV (Local As View): proposes to define the local schemas as
views over the integrated schema.
LAV generates issues of incomplete information, which adds
complexity in handling global queries, but it better supports dynamic
addition and removal of source.
Jarrar © 2013

20
Integration Process
After we identified the correspondences (in the previous
step), we now solve the conflicts:
One can distinguish between four types of conflicts:
–  Structural conflicts
–  Classification conflicts
–  Descriptive conflicts
–  Fragmentation conflicts

Examples of conflicts among related object types
–  different classifications (sets of instances)
–  different sets of properties
–  different structures
–  different coding schemes
–  …
Jarrar © 2013

21
Integration Rules
Rules defining the strategy to solve conflicts
Example rules:
–  If an class corresponds to an attribute, keep the class
–  If the population of a class is included in the population of another
class, build an is-a hierarchy

Integration rules depend on how you want the integrated
schema to look like

Jarrar © 2013

22
Structural Conflicts
Source: Stefano Spaccapietra

Different schema element types, e.g.: class, attribute, relationship

Library example:
–  S1 : Book is a class
–  S2 : books is an attribute of Author

S1

Conflict resolution :
Choose the less constraining structure
S2
–  Integrated Schema: Book is a class

Jarrar © 2013

23
Classification Conflicts
•  Corresponding elements describe different sets of real world objects
–  S1.Faculty CONTAINS S2.PhD-advisor

•  Conflict Resolution:
–  Generalization / Specialization hierarchy

S1

Faculty

Faculty

S2

Phd-advisor

Phd-advisor

–  Merging

Faculty
Jarrar © 2013

24
Descriptive Conflicts
Corresponding types have different properties, or corresponding
properties are described in different ways
Object / Entity / Relationship type:
–  Naming conflicts :
•  synonyms Node , Extremity
•  homonyms Highway (EU) , Highway (USA)

–  Composition conflicts : different attributes and methods
•  Employee ( E# , name , address )
•  Employee ( E# , position , salary , department )

Jarrar © 2013

25
Integration Methods: Manual
Source: Stefano Spaccapietra

First method: manual integration
“ do it yourself ”
a language
mapping
rules
integrated
schema

schemas

DBA

Easy to implement , Flexible
BUT
time consuming for the DBA
Jarrar © 2013

26
Integration Methods: Semi-Automatic
Second method : semi-automatic integration
“ tell me about the problem, I will try to fix it “

correspondences

mapping rules
TOOL
integrated
schema

schemas

DBA
Opens to visual CASE tools, integration servers
BUT knowledge acquisition can be painful
Jarrar © 2013

27
References and Acknowledgement
•  Carlo Batini: Course on Data Integration. BZU IT Summer School
2011.
•  Stefano Spaccapietra: Information Integration. Presentation at the IFIP
Academy. Porto Alegre. 2005.
•  Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI
International, Artificial Intelligence Center. Menlo Park, USA. 2009.

Thanks to Anton Deik for helping me preparing this lecture

Jarrar © 2013

28

Más contenido relacionado

La actualidad más candente

Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 
Fundamentals of Database ppt ch02
Fundamentals of Database ppt ch02Fundamentals of Database ppt ch02
Fundamentals of Database ppt ch02
Jotham Gadot
 

La actualidad más candente (20)

Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 
chapter 6 data visualization ppt.pptx
chapter 6 data visualization ppt.pptxchapter 6 data visualization ppt.pptx
chapter 6 data visualization ppt.pptx
 
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Best Practices with the DMM
Best Practices with the DMMBest Practices with the DMM
Best Practices with the DMM
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Machine learning life cycle
Machine learning life cycleMachine learning life cycle
Machine learning life cycle
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
 
Fundamentals of Database ppt ch02
Fundamentals of Database ppt ch02Fundamentals of Database ppt ch02
Fundamentals of Database ppt ch02
 
Mdm: why, when, how
Mdm: why, when, howMdm: why, when, how
Mdm: why, when, how
 
Overcoming the Challenges of your Master Data Management Journey
Overcoming the Challenges of your Master Data Management JourneyOvercoming the Challenges of your Master Data Management Journey
Overcoming the Challenges of your Master Data Management Journey
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
data modeling and models
data modeling and modelsdata modeling and models
data modeling and models
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 

Destacado

Pal gov.tutorial2.session13 1.data schema integration
Pal gov.tutorial2.session13 1.data schema integrationPal gov.tutorial2.session13 1.data schema integration
Pal gov.tutorial2.session13 1.data schema integration
Mustafa Jarrar
 
Pal gov.tutorial2.session15 1.linkeddata
Pal gov.tutorial2.session15 1.linkeddataPal gov.tutorial2.session15 1.linkeddata
Pal gov.tutorial2.session15 1.linkeddata
Mustafa Jarrar
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql Project
Mustafa Jarrar
 
Jarrar: Web 2 Data Mashups
Jarrar: Web 2 Data MashupsJarrar: Web 2 Data Mashups
Jarrar: Web 2 Data Mashups
Mustafa Jarrar
 
Jarrar: Data Integration and Fusion using RDF
Jarrar: Data Integration and Fusion using RDFJarrar: Data Integration and Fusion using RDF
Jarrar: Data Integration and Fusion using RDF
Mustafa Jarrar
 
Jarrar: The Next Generation of the Web 3.0: The Semantic Web Vesion
Jarrar: The Next Generation of the Web 3.0: The Semantic Web VesionJarrar: The Next Generation of the Web 3.0: The Semantic Web Vesion
Jarrar: The Next Generation of the Web 3.0: The Semantic Web Vesion
Mustafa Jarrar
 

Destacado (20)

Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
Data Integration (ETL)
Data Integration (ETL)Data Integration (ETL)
Data Integration (ETL)
 
Pay-as-you-go Reconciliation in Schema Matching Networks
Pay-as-you-go Reconciliation in Schema Matching NetworksPay-as-you-go Reconciliation in Schema Matching Networks
Pay-as-you-go Reconciliation in Schema Matching Networks
 
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksOn Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
 
Pal gov.tutorial2.session13 1.data schema integration
Pal gov.tutorial2.session13 1.data schema integrationPal gov.tutorial2.session13 1.data schema integration
Pal gov.tutorial2.session13 1.data schema integration
 
Pal gov.tutorial2.session15 1.linkeddata
Pal gov.tutorial2.session15 1.linkeddataPal gov.tutorial2.session15 1.linkeddata
Pal gov.tutorial2.session15 1.linkeddata
 
Jarrar: Zinnar
Jarrar: ZinnarJarrar: Zinnar
Jarrar: Zinnar
 
Jarrar: SPARQL - RDF Query Language
Jarrar: SPARQL - RDF Query LanguageJarrar: SPARQL - RDF Query Language
Jarrar: SPARQL - RDF Query Language
 
Jarrar: Sparql Project
Jarrar: Sparql ProjectJarrar: Sparql Project
Jarrar: Sparql Project
 
Jarrar: Web 2 Data Mashups
Jarrar: Web 2 Data MashupsJarrar: Web 2 Data Mashups
Jarrar: Web 2 Data Mashups
 
Jarrar: Subtype Relations and Constraints
Jarrar: Subtype Relations and ConstraintsJarrar: Subtype Relations and Constraints
Jarrar: Subtype Relations and Constraints
 
Jarrar: Data Integration and Fusion using RDF
Jarrar: Data Integration and Fusion using RDFJarrar: Data Integration and Fusion using RDF
Jarrar: Data Integration and Fusion using RDF
 
Jarrar: Architectural Solutions in Data Integration
Jarrar: Architectural Solutions in Data IntegrationJarrar: Architectural Solutions in Data Integration
Jarrar: Architectural Solutions in Data Integration
 
Jarrar: Linked Data
Jarrar: Linked DataJarrar: Linked Data
Jarrar: Linked Data
 
Jarrar: Knowledge Engineering- Course Outline
Jarrar: Knowledge Engineering- Course OutlineJarrar: Knowledge Engineering- Course Outline
Jarrar: Knowledge Engineering- Course Outline
 
Jarrar: Introduction to Data Integration
Jarrar: Introduction to Data IntegrationJarrar: Introduction to Data Integration
Jarrar: Introduction to Data Integration
 
Jarrar: RDF Stores: Challenges and Solutions
Jarrar: RDF Stores: Challenges and SolutionsJarrar: RDF Stores: Challenges and Solutions
Jarrar: RDF Stores: Challenges and Solutions
 
Jarrar: RDFs -RDF Schema
Jarrar: RDFs -RDF SchemaJarrar: RDFs -RDF Schema
Jarrar: RDFs -RDF Schema
 
Jarrar: Data Fusion using RDF
Jarrar: Data Fusion using RDFJarrar: Data Fusion using RDF
Jarrar: Data Fusion using RDF
 
Jarrar: The Next Generation of the Web 3.0: The Semantic Web Vesion
Jarrar: The Next Generation of the Web 3.0: The Semantic Web VesionJarrar: The Next Generation of the Web 3.0: The Semantic Web Vesion
Jarrar: The Next Generation of the Web 3.0: The Semantic Web Vesion
 

Similar a Jarrar: Data Schema Integration

Mapping objects to_relational_databases
Mapping objects to_relational_databasesMapping objects to_relational_databases
Mapping objects to_relational_databases
Ivan Paredes
 
Improved Presentation and Facade Layer Operations for Software Engineering Pr...
Improved Presentation and Facade Layer Operations for Software Engineering Pr...Improved Presentation and Facade Layer Operations for Software Engineering Pr...
Improved Presentation and Facade Layer Operations for Software Engineering Pr...
Dr. Amarjeet Singh
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
ijceronline
 
Object relationship mapping and hibernate
Object relationship mapping and hibernateObject relationship mapping and hibernate
Object relationship mapping and hibernate
Joe Jacob
 

Similar a Jarrar: Data Schema Integration (20)

Jarrar: Data Schema Integration
Jarrar: Data Schema Integration Jarrar: Data Schema Integration
Jarrar: Data Schema Integration
 
Data Modeling.docx
Data Modeling.docxData Modeling.docx
Data Modeling.docx
 
Data models
Data modelsData models
Data models
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Mapping objects to_relational_databases
Mapping objects to_relational_databasesMapping objects to_relational_databases
Mapping objects to_relational_databases
 
Database Management System, Lecture-1
Database Management System, Lecture-1Database Management System, Lecture-1
Database Management System, Lecture-1
 
Unified modeling language
Unified modeling languageUnified modeling language
Unified modeling language
 
Comparison of Relational Database and Object Oriented Database
Comparison of Relational Database and Object Oriented DatabaseComparison of Relational Database and Object Oriented Database
Comparison of Relational Database and Object Oriented Database
 
Improved Presentation and Facade Layer Operations for Software Engineering Pr...
Improved Presentation and Facade Layer Operations for Software Engineering Pr...Improved Presentation and Facade Layer Operations for Software Engineering Pr...
Improved Presentation and Facade Layer Operations for Software Engineering Pr...
 
OOM Unit I - III.pdf
OOM Unit I - III.pdfOOM Unit I - III.pdf
OOM Unit I - III.pdf
 
ITB - UNIT 3.pdf
ITB - UNIT 3.pdfITB - UNIT 3.pdf
ITB - UNIT 3.pdf
 
DBMS-Unit-1.pptx
DBMS-Unit-1.pptxDBMS-Unit-1.pptx
DBMS-Unit-1.pptx
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
MAPPING COMMON ERRORS IN ENTITY RELATIONSHIP DIAGRAM DESIGN OF NOVICE DESIGNERS
MAPPING COMMON ERRORS IN ENTITY RELATIONSHIP DIAGRAM DESIGN OF NOVICE DESIGNERSMAPPING COMMON ERRORS IN ENTITY RELATIONSHIP DIAGRAM DESIGN OF NOVICE DESIGNERS
MAPPING COMMON ERRORS IN ENTITY RELATIONSHIP DIAGRAM DESIGN OF NOVICE DESIGNERS
 
Object oriented analysis and design unit- iv
Object oriented analysis and design unit- ivObject oriented analysis and design unit- iv
Object oriented analysis and design unit- iv
 
Object relationship mapping and hibernate
Object relationship mapping and hibernateObject relationship mapping and hibernate
Object relationship mapping and hibernate
 
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
 
chapter 2-DATABASE SYSTEM CONCEPTS AND architecture [Autosaved].pdf
chapter 2-DATABASE SYSTEM CONCEPTS AND architecture [Autosaved].pdfchapter 2-DATABASE SYSTEM CONCEPTS AND architecture [Autosaved].pdf
chapter 2-DATABASE SYSTEM CONCEPTS AND architecture [Autosaved].pdf
 
Chapter – 2 Data Models.pdf
Chapter – 2 Data Models.pdfChapter – 2 Data Models.pdf
Chapter – 2 Data Models.pdf
 
DBMS-2.pptx
DBMS-2.pptxDBMS-2.pptx
DBMS-2.pptx
 

Último

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Último (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

Jarrar: Data Schema Integration

  • 1. Mustafa Jarrar Lecture Notes, Web Data Management (MCOM7348) University of Birzeit, Palestine 1st Semester, 2013 Data Schema Integration Dr. Mustafa Jarrar University of Birzeit mjarrar@birzeit.edu www.jarrar.info Jarrar © 2013 1
  • 2. Watch this lecture and download the slides from http://jarrar-courses.blogspot.com/2013/11/web-data-management.html Jarrar © 2013 2
  • 3. Data Schema Integration: A simple example In ORM: bornIn/ locatedIn/ /WorksIn Employee City Region Integrated schema locatedIn/ Organization /WorksIn Employee Organization Municipality locatedIn/ bornIn/ Worker City Schema 2 Region Organization locatedIn / Schema 3 Schema 1 Jarrar © 2013 3
  • 4. Data Schema Integration: A simple example Source: Carlo Batini In ER: Employee born City in works Organization Region Integrated schema in Employee works Organization Empoloyee born City in Schema 2 Schema 1 Region Municipality Organization in Schema 3 Jarrar © 2013 4
  • 5. Challenges of Data Schema Integration Source: Carlo Batini Schema Integration has two major challenges: 1.  Identification of all portions of schemas that pertain to the same concept, in such a way to unify such different representations in the global schema. 2.  Identification, analysis and resolution of the different types of conflicts (heterogeneities) in different schemas. Jarrar © 2013 5
  • 6. Framework for Schema Integration Local Schemas Schemas Transformation Transformation Rules Schemas Matching Matching Rules Schemas Integration Integration Rules Integrated Schema and mappings Source: Advances in Object-Oriented Data Modeling, M. P. Papazoglou, S. Spaccapietra, Z. Tari (Eds.), The MIT Press, 2000 Jarrar © 2013 6
  • 7. Framework for Schema Integration 0. Define the integration strategy If the number of local schemas to be integrated is large, the order of schema integration becomes important. Several strategies can be adopted. Input: n source schemas Output: n source schemas + integration strategies Method used: heuristics S1 S2 S3 S1 S2 S3 S4 One shot strategy - Efficient integration process S3 S4 S2 IS1 IS1 IS S1 IS2 IS2 IS … IS Pair at a time strategy - Priority to most relevant and stable schemas. - The integration process is - Many correspondences more efficient between concepts have to be considered together. Jarrar © 2013 Balanced Strategy -e.g.: Production, Marketing, Sales. -To be preferred when the cohesion among schemas is high. 7
  • 8. Framework for Schema Integration Source: Stefano Spaccapietra 1. Schema transformation (or Pre-integration) Input: n source schemas Output: n source schemas homogeneized Methods used: Model and Design Homogeneization Reduce model heterogeneities as much as possible to make the sources more suitable for integration. Goal: use a single, common data model and format. Transformation Source DBs Integration Homogeneized DBs Jarrar © 2013 DW 8
  • 9. Schema Transformation Schema Transformation involves: •  Data model homogeneization –  Where all data sources are described using the same data model. •  Design homogeneization –  Enforce standard design rules to reduce the number of structural conflicts (e.g., Normalization: one fact in one place) •  Reverse Engineering –  Reverse engineer the schema from existing data (such as COBOL files, spreadsheets, legacy relational databases, legacy objectoriented databases). Jarrar © 2013 9
  • 10. Example of Design homogeneization (Normalization) ONE TABLE: R1 (#Student, Name, LastName, #Course, CourseName, Grade, Date) Dependencies: –  #Student à Name, LastName –  #Course à CourseName –  #Student #Course à Grade, Date) Normalized Into 3 Tables: One Fact In One Place: R11 (#Student, Name, LastName) R12 (#Course, CourseName) R13 (#Student, #Course, Grade, Date) Jarrar © 2013 10
  • 11. Example of Reverse Engineering Source: Stefano Spaccapietra Jarrar © 2013 11
  • 12. Schema Matching 2. Schema matching (Correspondences investigation) Input: n source schemas Output: n source schemas + correspondences Method used: techniques to discover correspondences Correspondences relate (schema) elements which describe the same phenomena of the real world. –  This step aims at finding and describing all semantic links between elements of the input schemas and the corresponding data. –  By doing so, one matches between the schemas to be integrated. –  This step fixes the conflicts found in the schema. Jarrar © 2013 12
  • 13. Semantics of Correspondences Source: Stefano Spaccapietra Correspondences relate (schema) elements which describe the same phenomena of the real world. Jarrar © 2013 13
  • 14. Asserting Correspondences Source: Stefano Spaccapietra Finding matching correspondences is done through the use of a rich language for expressing correspondences (matchings). Example: S1.Person ≡ S2.Person, With Corresponding Identifiers: Pin, With Corresponding Property: name Jarrar © 2013 14
  • 15. Automated Matching •  Fully automated matching is impossible, as a computer process can hardly make ultimate decisions about the semantics of data. •  But even partial assistance in discovering of correspondences (to be confirmed or guided by humans) is beneficial, due to the complexity of the task. •  All proposed methods rely on some similarity measures that try to evaluate the semantic distance between two descriptions. Some state of the art matching systems Cupid (Microsoft Research, USA) FOAM/QOM (University of Karlsruhe, Germany) OLA (INRIA Rhône-Alpes, France / University of Montreal, Canada) S-Match (University of Trento, Italy) … many others Jarrar © 2013 15
  • 16. Examples of Correspondences Source: Stefano Spaccapietra Jarrar © 2013 16
  • 17. Examples of Correspondences Previous example Employee /WorksIn Municipality locatedIn/ Organization Organization Schema 1 Worker Schema 3 bornIn/ City locatedIn/ Region Schema 2 Jarrar © 2013 17
  • 18. Examples of Correspondences Source: Stefano Spaccapietra Jarrar © 2013 18
  • 19. Schema Integration & Mapping Generation Source: Carlo Batini 3. Schemas integration and mapping generation Input: n source schemas + correspondences Output: integrated schema + mapping rules btw the integrated schema and input source schemas Method used: New classification of conflicts + Conflict resolution transformations GOAL: Creating an Integrated Schema ( IS ) and the mappings to the local databases. Jarrar © 2013 19
  • 20. GAV and LAV Integration Research has identified two methods to set up mappings between the integrated schema and the input schemas: (1)  GAV (Global As View): proposes to define the integrated schema as a view over input schemas. GAV is usually considered simpler and more efficient for processing queries on the integrated database, but is weaker in supporting evolution of the global system through addition of new sources. (2)  LAV (Local As View): proposes to define the local schemas as views over the integrated schema. LAV generates issues of incomplete information, which adds complexity in handling global queries, but it better supports dynamic addition and removal of source. Jarrar © 2013 20
  • 21. Integration Process After we identified the correspondences (in the previous step), we now solve the conflicts: One can distinguish between four types of conflicts: –  Structural conflicts –  Classification conflicts –  Descriptive conflicts –  Fragmentation conflicts Examples of conflicts among related object types –  different classifications (sets of instances) –  different sets of properties –  different structures –  different coding schemes –  … Jarrar © 2013 21
  • 22. Integration Rules Rules defining the strategy to solve conflicts Example rules: –  If an class corresponds to an attribute, keep the class –  If the population of a class is included in the population of another class, build an is-a hierarchy Integration rules depend on how you want the integrated schema to look like Jarrar © 2013 22
  • 23. Structural Conflicts Source: Stefano Spaccapietra Different schema element types, e.g.: class, attribute, relationship Library example: –  S1 : Book is a class –  S2 : books is an attribute of Author S1 Conflict resolution : Choose the less constraining structure S2 –  Integrated Schema: Book is a class Jarrar © 2013 23
  • 24. Classification Conflicts •  Corresponding elements describe different sets of real world objects –  S1.Faculty CONTAINS S2.PhD-advisor •  Conflict Resolution: –  Generalization / Specialization hierarchy S1 Faculty Faculty S2 Phd-advisor Phd-advisor –  Merging Faculty Jarrar © 2013 24
  • 25. Descriptive Conflicts Corresponding types have different properties, or corresponding properties are described in different ways Object / Entity / Relationship type: –  Naming conflicts : •  synonyms Node , Extremity •  homonyms Highway (EU) , Highway (USA) –  Composition conflicts : different attributes and methods •  Employee ( E# , name , address ) •  Employee ( E# , position , salary , department ) Jarrar © 2013 25
  • 26. Integration Methods: Manual Source: Stefano Spaccapietra First method: manual integration “ do it yourself ” a language mapping rules integrated schema schemas DBA Easy to implement , Flexible BUT time consuming for the DBA Jarrar © 2013 26
  • 27. Integration Methods: Semi-Automatic Second method : semi-automatic integration “ tell me about the problem, I will try to fix it “ correspondences mapping rules TOOL integrated schema schemas DBA Opens to visual CASE tools, integration servers BUT knowledge acquisition can be painful Jarrar © 2013 27
  • 28. References and Acknowledgement •  Carlo Batini: Course on Data Integration. BZU IT Summer School 2011. •  Stefano Spaccapietra: Information Integration. Presentation at the IFIP Academy. Porto Alegre. 2005. •  Chris Bizer: The Emerging Web of Linked Data. Presentation at SRI International, Artificial Intelligence Center. Menlo Park, USA. 2009. Thanks to Anton Deik for helping me preparing this lecture Jarrar © 2013 28