SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Introduction   Mapping method: classification based on instance similarity   Experiments and results   Summary




               Learning Concept Mappings from Instance
                              Similarity

          Shenghui Wang1                   Gwenn Englebienne2               Stefan Schlobach1
                                      1    Vrije Universiteit Amsterdam
                                       2   Universiteit van Amsterdam


                                                 ISWC 2008
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary




Outline

       1       Introduction
                  Thesaurus mapping
                  Instance-based techniques

       2       Mapping method: classification based on instance similarity
                Representing concepts and the similarity between them
                Classification based on instance similarity

       3       Experiments and results
                 Experiment setup
                 Results

       4       Summary
Introduction        Mapping method: classification based on instance similarity   Experiments and results   Summary

Thesaurus mapping


Thesaurus mapping



               SemanTic Interoperability To access Cultural Heritage
               (STITCH) through mappings between thesauri
                       e.g.“plankzeilen” vs. “surfsport”
                       e.g.“griep” vs. “influenza”
               Scope of the problem:
                       Big thesauri with tens of thousands of concepts
                       Huge collections (e.g., KB: 80km of books in one collection)
                       Heterogeneous (e.g., books, manuscripts, illustrations, etc.)
                       Multi-lingual problem
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Instance-based techniques: common instance based
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Instance-based techniques: common instance based
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Instance-based techniques: common instance based
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Pros and cons




               Advantages
                       Simple to implement
                       Interesting results
               Disadvantages
                       Requires sufficient amounts of common instances
                       Only uses part of the available information
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Instance-based techniques: Instance similarity based
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Instance-based techniques: Instance similarity based
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Instance-based techniques: Instance similarity based
Introduction       Mapping method: classification based on instance similarity                       Experiments and results   Summary

Representing concepts and the similarity between them


Representing concepts and the similarity between them




                                      {
                                          Creator                              Creator
                                          Title                              Term 1: 4
                                          Publisher       Bag of words       Term 2: 1
                                                                             Term 3: 0
                                          ...                                ...
                          Concept 1


                                          Creator                                Title
                                          Title                              Term 1: 0
                                          Publisher       Bag of words       Term 2: 3
                                                                             Term 3: 0
                                          ...                                ...
                                          Creator                            Publisher      Cos. dist.
                                          Title                              Term 1: 2
                                                          Bag of words       Term 2: 1
                                          Publisher
                                                                             Term 3: 3
                                          ...                                ...                                 f1
                                                                                            Cos. dist.          f2




                                      {
                                                                               Creator                          f3
                                                                             Term 1: 2                          ...
                                                          Bag of words       Term 2: 0
                                          Creator                            Term 3: 0
                                          Title                              ...            Cos. dist.
                          Concept 2




                                          Publisher                              Title
                                          ...                                Term 1: 0
                                                          Bag of words       Term 2: 4
                                          Creator                            Term 3: 1
                                          Title                              ...
                                          Publisher                          Publisher
                                          ...                                Term 1: 4
                                                          Bag of words       Term 2: 1
                                                                             Term 3: 1
                                                                             ...
                                      {
                                                                         {
                                                                                                            {
                                      Instance features                  Concept features                Pair features
Introduction         Mapping method: classification based on instance similarity   Experiments and results   Summary

Classification based on instance similarity


Classification based on instance similarity


               Each pair of concepts is treated as a point in a “similarity
               space”
                        Its position is defined by the features of the pair.
                        The features of the pair are the different measures of similarity
                        between the concepts’ instances.
               Hypothesis: the label of a point — which represents whether
               the pair is a positive mapping or negative one — is correlated
               with the position of this point in this space.
               With already labelled points and the actual similarity values of
               concepts involved, it is possible to classify a point, i.e., to give
               it a right label, based on its location given by the actual
               similarity values.
Introduction         Mapping method: classification based on instance similarity       Experiments and results    Summary

Classification based on instance similarity


The classifier used: Markov Random Field


               Let T = { (x(i) , y (i) ) }N be the training set
                                          i=1
                        x(i) ∈ RK , the features
                        y (i) ∈ Y = {positive, negative}, the label
               The conditional probability of a label given the input is
               modelled as
                                                                             K
                                                        1
                            p(y (i) |xi , θ) =                 exp                λj φj (y (i) , x(i) ) ,       (1)
                                                    Z (xi , θ)
                                                                            j=1


               where θ = { λj }K are the weights associated to the feature
                               j=1
               functions φ and Z (xi , θ) is a normalisation constant
Introduction         Mapping method: classification based on instance similarity     Experiments and results    Summary

Classification based on instance similarity


The classifier used: Markov Random Field (cont’)

               The likelihood of the data set for given model parameters
               p(T |θ) is given by:
                                                                  N
                                               p(T |θ) =               p(y (i) |x(i) )                        (2)
                                                                 i=1

               During learning, our objective is to find the most likely values
               for θ for the given training data.
               The decision criterion for assigning a label y (i) to a new pair
               of concepts i is then simply given by:

                                                 y (i) = argmax p(y |x(i) )                                   (3)
                                                                 y
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Experiment setup


Experiment setup



               Two cases:
                      mapping GTT and Brinkman used in Koninklijke Bibliotheek
                      (KB) — Homogeneous collections
                      mapping GTT/Brinkman and GTAA used in Beeld en Geluid
                      (BG) — Heterogeneous collections
               Evaluation
                      Measures: misclassification rate or error rate
                      10 fold cross-validation
                      testing on special data sets
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Results

Experiment I: Feature-similarity based mapping versus
existing methods

          Are the benefits from feature-similarity of instances in extensional
          mapping significant when compared to existing methods?

                            Mapping method                                    Error rate
                                  Falcon                                       0.28895
                                   Slex                                  0.42620 ± 0.049685
                                  Sjacc80                                0.44643 ± 0.059524
                                   Sbag                                  0.57380 ± 0.049685
               {f1 , . . . f28 } (our new approach)                     0.20491 ± 0.026158

          Table: Comparison between existing methods and similarities-based
          mapping, in KB case
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Results

Experiment I: Feature-similarity based mapping versus
existing methods

          Are the benefits from feature-similarity of instances in extensional
          mapping significant when compared to existing methods?

                            Mapping method                                    Error rate
                                  Falcon                                       0.28895
                                   Slex                                  0.42620 ± 0.049685
                                  Sjacc80                                0.44643 ± 0.059524
                                   Sbag                                  0.57380 ± 0.049685
               {f1 , . . . f28 } (our new approach)                     0.20491 ± 0.026158

          Table: Comparison between existing methods and similarities-based
          mapping, in KB case
Introduction      Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment II: Extending to corpora without joint instances

          Can our approach be applied to corpora for which there are no
          doubly annotated instances, i.e., for which there are no joint
          instances?
                   Collections                         Testing set                  Error rate
                 Joint instances                     golden standard           0.20491 ± 0.026158
              (original KB corpus)                     lexical only                 0.137871
               No joint instances                    golden standard           0.28378 ± 0.026265
           (double instances removed)                  lexical only                 0.161867

          Table: Comparison between classifiers using joint and disjoint instances,
          in KB case
Introduction      Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment II: Extending to corpora without joint instances

          Can our approach be applied to corpora for which there are no
          doubly annotated instances, i.e., for which there are no joint
          instances?
                   Collections                         Testing set                  Error rate
                 Joint instances                     golden standard           0.20491 ± 0.026158
              (original KB corpus)                     lexical only                 0.137871
               No joint instances                    golden standard           0.28378 ± 0.026265
           (double instances removed)                  lexical only                 0.161867

          Table: Comparison between classifiers using joint and disjoint instances,
          in KB case
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment III: Extending to heterogeneous collections



          Can our approach be applied to corpora in which instances are
          described in a heterogeneous way?
               Feature selection
                    exhaustive combination by calculating the similarity between
                    all possible pairs of fields
                            require more training data to avoid over-fitting
                    manual selection of corresponding metadata field pairs
                    mutual information to select the most informative field pairs
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment III: Extending to heterogeneous collections



          Can our approach be applied to corpora in which instances are
          described in a heterogeneous way?
               Feature selection
                    exhaustive combination by calculating the similarity between
                    all possible pairs of fields
                            require more training data to avoid over-fitting
                    manual selection of corresponding metadata field pairs
                    mutual information to select the most informative field pairs
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment III: Extending to heterogeneous collections



          Can our approach be applied to corpora in which instances are
          described in a heterogeneous way?
               Feature selection
                    exhaustive combination by calculating the similarity between
                    all possible pairs of fields
                            require more training data to avoid over-fitting
                    manual selection of corresponding metadata field pairs
                    mutual information to select the most informative field pairs
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment III: Extending to heterogeneous collections



          Can our approach be applied to corpora in which instances are
          described in a heterogeneous way?
               Feature selection
                    exhaustive combination by calculating the similarity between
                    all possible pairs of fields
                            require more training data to avoid over-fitting
                    manual selection of corresponding metadata field pairs
                    mutual information to select the most informative field pairs
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment III: Extending to heterogeneous collections



          Can our approach be applied to corpora in which instances are
          described in a heterogeneous way?
               Feature selection
                    exhaustive combination by calculating the similarity between
                    all possible pairs of fields
                            require more training data to avoid over-fitting
                    manual selection of corresponding metadata field pairs
                    mutual information to select the most informative field pairs
Introduction      Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Feature selection
          Can we maintain high mapping quality when features are selected
          (semi)-automatically?


                 Thesaurus                     Feature selection                     Error rate
                                               manual selection                 0.11290 ± 0.025217
           GTAA vs. Brinkman                  mutual information               0.09355 ± 0.044204
                                                  exhaustive                    0.10323 ± 0.031533
                                               manual selection                 0.10000 ± 0.050413
               GTAA vs. GTT                   mutual information               0.07826 ± 0.044904
                                                  exhaustive                    0.11304 ± 0.046738


          Table: Comparison of the performance with different methods of feature
          selection, using non-lexical dataset
Introduction      Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Feature selection
          Can we maintain high mapping quality when features are selected
          (semi)-automatically?


                 Thesaurus                     Feature selection                     Error rate
                                               manual selection                 0.11290 ± 0.025217
           GTAA vs. Brinkman                  mutual information               0.09355 ± 0.044204
                                                  exhaustive                    0.10323 ± 0.031533
                                               manual selection                 0.10000 ± 0.050413
               GTAA vs. GTT                   mutual information               0.07826 ± 0.044904
                                                  exhaustive                    0.11304 ± 0.046738


          Table: Comparison of the performance with different methods of feature
          selection, using non-lexical dataset
Introduction      Mapping method: classification based on instance similarity      Experiments and results   Summary

Results


Training set


               manually built golden standard (751)
               lexical seeding
               background seeding

                             Thesauri                          lexical         non-lexical
                           GTAA vs. GTT                         2720              116
                         GTAA vs. Brinkman                      1372              323

                  Table: Numbers of positive examples in the training sets
Introduction      Mapping method: classification based on instance similarity      Experiments and results   Summary

Results


Training set


               manually built golden standard (751)
               lexical seeding
               background seeding

                             Thesauri                          lexical         non-lexical
                           GTAA vs. GTT                         2720              116
                         GTAA vs. Brinkman                      1372              323

                  Table: Numbers of positive examples in the training sets
Introduction      Mapping method: classification based on instance similarity      Experiments and results   Summary

Results


Bias in the training sets

               Thesauri       Training set              Test set                    Error rate
                              non-lexical              non-lexical             0.09355 ± 0.044204
               GTAA vs.          lexical               non-lexical                   0.11501
               Brinkman       non-lexical                lexical                     0.07124
                                 lexical                 lexical               0.04871 ± 0.029911
                              non-lexical              non-lexical             0.07826 ± 0.044904
               GTAA vs.          lexical               non-lexical                  0.098712
                 GTT          non-lexical                lexical                    0.088603
                                 lexical                 lexical               0.06195 ± 0.008038

          Table: Comparison using different datasets (feature selected using mutual
          information)
Introduction      Mapping method: classification based on instance similarity      Experiments and results   Summary

Results


Bias in the training sets

               Thesauri       Training set              Test set                    Error rate
                              non-lexical              non-lexical             0.09355 ± 0.044204
               GTAA vs.          lexical               non-lexical                   0.11501
               Brinkman       non-lexical                lexical                     0.07124
                                 lexical                 lexical               0.04871 ± 0.029911
                              non-lexical              non-lexical             0.07826 ± 0.044904
               GTAA vs.          lexical               non-lexical                  0.098712
                 GTT          non-lexical                lexical                    0.088603
                                 lexical                 lexical               0.06195 ± 0.008038

          Table: Comparison using different datasets (feature selected using mutual
          information)
Introduction                      Mapping method: classification based on instance similarity                                                             Experiments and results                                Summary

Results


Positive-negative ratios in the training sets


                       1                                                                                                          1
                                                                                                                                                                                                            error rate
                                                                                                error rate
                      0.9                                                                                                        0.9                                                                        F−measure
                                                                                                F−measure                                                                                                                pos
                                                                                                             pos
                                                                                                                                                                                                            Precision
                      0.8                                                                       Precisionpos                     0.8                                                                                 pos
                                                                                                                                                                                                            Recall pos
                                                                                                Recall pos
                      0.7                                                                                                        0.7

                      0.6                                                                                                        0.6
           Measures




                                                                                                                      Measures
                      0.5                                                                                                        0.5

                      0.4                                                                                                        0.4

                      0.3                                                                                                        0.3

                      0.2                                                                                                        0.2

                      0.1                                                                                                        0.1

                       0                                                                                                          0
                            0   100      200     300       400     500      600     700       800      900     1000                    0    100      200     300       400     500      600     700       800      900     1000
                                      Ratio between positive and negative examples in the training set                                            Ratio between positive and negative examples in the training set




                                (a) testing on 1:1 data                                                                                    (b) testing on 1:1000 data
          Figure: The influence of positive-negative ratios — Brinkman vs. GTAA
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Positive-negative ratios in the training sets (cont’)




          In practice, the training data should be chosen so as to contain a
          representative ratio of positive and negative examples, while still
          providing enough material for the classifier to have good
          predictive capacity on both types of examples.
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment IV: Qualitative use of feature weights
          The value of learning results, λj , reflects the importance of the
          feature fj in the process of determining similarity (mappings)
          between concepts.

                        KB fields                           BG fields
                        kb:title                           bg:subject
                        kb:abstract                        bg:subject
                        kb:annotation                      bg:LOCATIES
                        kb:annotation                      bg:SUBSIDIE
                        kb:creator                         bg:contributor
                        kb:creator                         bg:PERSOONSNAMEN
                        kb:Date                            bg:OPNAMEDATUM
                        kb:dateCopyrighted                 bg:date
                        kb:description                     bg:subject
                        kb:publisher                       bg:NAMEN
                        kb:temporal                        bg:date
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment IV: Qualitative use of feature weights
          The value of learning results, λj , reflects the importance of the
          feature fj in the process of determining similarity (mappings)
          between concepts.

                        KB fields                           BG fields
                        kb:title                           bg:subject
                        kb:abstract                        bg:subject
                        kb:annotation                      bg:LOCATIES
                        kb:annotation                      bg:SUBSIDIE
                        kb:creator                         bg:contributor
                        kb:creator                         bg:PERSOONSNAMEN
                        kb:Date                            bg:OPNAMEDATUM
                        kb:dateCopyrighted                 bg:date
                        kb:description                     bg:subject
                        kb:publisher                       bg:NAMEN
                        kb:temporal                        bg:date
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary




Summary

               We use a machine learning method to automatically use the
               similarity between instances to determine mappings between
               concepts from different thesauri/ontologies.
                    Enables mappings between thesauri used for very
                    heterogeneous collections
                    Does not require dually annotated instances
                    Not limited by the language barrier
                    A contribution to the field of meta-data mapping

               In the future
                    More heterogeneous collections
                    Smarter measures of similarity between instance metadata
                    More similarity dimensions between concepts, e.g., lexical,
                    structural
Introduction   Mapping method: classification based on instance similarity   Experiments and results   Summary




       Thank you

Más contenido relacionado

Más de Shenghui Wang

Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...Shenghui Wang
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Shenghui Wang
 
What is concept dirft and how to measure it?
What is concept dirft and how to measure it?What is concept dirft and how to measure it?
What is concept dirft and how to measure it?Shenghui Wang
 
Study concept drift in political ontologies
Study concept drift in political ontologiesStudy concept drift in political ontologies
Study concept drift in political ontologiesShenghui Wang
 

Más de Shenghui Wang (6)

Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning
 
What is concept dirft and how to measure it?
What is concept dirft and how to measure it?What is concept dirft and how to measure it?
What is concept dirft and how to measure it?
 
ICA Slides
ICA SlidesICA Slides
ICA Slides
 
ECCS 2010
ECCS 2010ECCS 2010
ECCS 2010
 
Study concept drift in political ontologies
Study concept drift in political ontologiesStudy concept drift in political ontologies
Study concept drift in political ontologies
 

Último

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Learning Concept Mappings from Instance Similarity

  • 1. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Learning Concept Mappings from Instance Similarity Shenghui Wang1 Gwenn Englebienne2 Stefan Schlobach1 1 Vrije Universiteit Amsterdam 2 Universiteit van Amsterdam ISWC 2008
  • 2. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Outline 1 Introduction Thesaurus mapping Instance-based techniques 2 Mapping method: classification based on instance similarity Representing concepts and the similarity between them Classification based on instance similarity 3 Experiments and results Experiment setup Results 4 Summary
  • 3. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Thesaurus mapping Thesaurus mapping SemanTic Interoperability To access Cultural Heritage (STITCH) through mappings between thesauri e.g.“plankzeilen” vs. “surfsport” e.g.“griep” vs. “influenza” Scope of the problem: Big thesauri with tens of thousands of concepts Huge collections (e.g., KB: 80km of books in one collection) Heterogeneous (e.g., books, manuscripts, illustrations, etc.) Multi-lingual problem
  • 4. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Instance-based techniques: common instance based
  • 5. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Instance-based techniques: common instance based
  • 6. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Instance-based techniques: common instance based
  • 7. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Pros and cons Advantages Simple to implement Interesting results Disadvantages Requires sufficient amounts of common instances Only uses part of the available information
  • 8. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Instance-based techniques: Instance similarity based
  • 9. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Instance-based techniques: Instance similarity based
  • 10. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Instance-based techniques: Instance similarity based
  • 11. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Representing concepts and the similarity between them Representing concepts and the similarity between them { Creator Creator Title Term 1: 4 Publisher Bag of words Term 2: 1 Term 3: 0 ... ... Concept 1 Creator Title Title Term 1: 0 Publisher Bag of words Term 2: 3 Term 3: 0 ... ... Creator Publisher Cos. dist. Title Term 1: 2 Bag of words Term 2: 1 Publisher Term 3: 3 ... ... f1 Cos. dist. f2 { Creator f3 Term 1: 2 ... Bag of words Term 2: 0 Creator Term 3: 0 Title ... Cos. dist. Concept 2 Publisher Title ... Term 1: 0 Bag of words Term 2: 4 Creator Term 3: 1 Title ... Publisher Publisher ... Term 1: 4 Bag of words Term 2: 1 Term 3: 1 ... { { { Instance features Concept features Pair features
  • 12. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Classification based on instance similarity Classification based on instance similarity Each pair of concepts is treated as a point in a “similarity space” Its position is defined by the features of the pair. The features of the pair are the different measures of similarity between the concepts’ instances. Hypothesis: the label of a point — which represents whether the pair is a positive mapping or negative one — is correlated with the position of this point in this space. With already labelled points and the actual similarity values of concepts involved, it is possible to classify a point, i.e., to give it a right label, based on its location given by the actual similarity values.
  • 13. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Classification based on instance similarity The classifier used: Markov Random Field Let T = { (x(i) , y (i) ) }N be the training set i=1 x(i) ∈ RK , the features y (i) ∈ Y = {positive, negative}, the label The conditional probability of a label given the input is modelled as K 1 p(y (i) |xi , θ) = exp λj φj (y (i) , x(i) ) , (1) Z (xi , θ) j=1 where θ = { λj }K are the weights associated to the feature j=1 functions φ and Z (xi , θ) is a normalisation constant
  • 14. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Classification based on instance similarity The classifier used: Markov Random Field (cont’) The likelihood of the data set for given model parameters p(T |θ) is given by: N p(T |θ) = p(y (i) |x(i) ) (2) i=1 During learning, our objective is to find the most likely values for θ for the given training data. The decision criterion for assigning a label y (i) to a new pair of concepts i is then simply given by: y (i) = argmax p(y |x(i) ) (3) y
  • 15. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Experiment setup Experiment setup Two cases: mapping GTT and Brinkman used in Koninklijke Bibliotheek (KB) — Homogeneous collections mapping GTT/Brinkman and GTAA used in Beeld en Geluid (BG) — Heterogeneous collections Evaluation Measures: misclassification rate or error rate 10 fold cross-validation testing on special data sets
  • 16. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment I: Feature-similarity based mapping versus existing methods Are the benefits from feature-similarity of instances in extensional mapping significant when compared to existing methods? Mapping method Error rate Falcon 0.28895 Slex 0.42620 ± 0.049685 Sjacc80 0.44643 ± 0.059524 Sbag 0.57380 ± 0.049685 {f1 , . . . f28 } (our new approach) 0.20491 ± 0.026158 Table: Comparison between existing methods and similarities-based mapping, in KB case
  • 17. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment I: Feature-similarity based mapping versus existing methods Are the benefits from feature-similarity of instances in extensional mapping significant when compared to existing methods? Mapping method Error rate Falcon 0.28895 Slex 0.42620 ± 0.049685 Sjacc80 0.44643 ± 0.059524 Sbag 0.57380 ± 0.049685 {f1 , . . . f28 } (our new approach) 0.20491 ± 0.026158 Table: Comparison between existing methods and similarities-based mapping, in KB case
  • 18. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment II: Extending to corpora without joint instances Can our approach be applied to corpora for which there are no doubly annotated instances, i.e., for which there are no joint instances? Collections Testing set Error rate Joint instances golden standard 0.20491 ± 0.026158 (original KB corpus) lexical only 0.137871 No joint instances golden standard 0.28378 ± 0.026265 (double instances removed) lexical only 0.161867 Table: Comparison between classifiers using joint and disjoint instances, in KB case
  • 19. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment II: Extending to corpora without joint instances Can our approach be applied to corpora for which there are no doubly annotated instances, i.e., for which there are no joint instances? Collections Testing set Error rate Joint instances golden standard 0.20491 ± 0.026158 (original KB corpus) lexical only 0.137871 No joint instances golden standard 0.28378 ± 0.026265 (double instances removed) lexical only 0.161867 Table: Comparison between classifiers using joint and disjoint instances, in KB case
  • 20. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  • 21. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  • 22. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  • 23. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  • 24. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  • 25. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Feature selection Can we maintain high mapping quality when features are selected (semi)-automatically? Thesaurus Feature selection Error rate manual selection 0.11290 ± 0.025217 GTAA vs. Brinkman mutual information 0.09355 ± 0.044204 exhaustive 0.10323 ± 0.031533 manual selection 0.10000 ± 0.050413 GTAA vs. GTT mutual information 0.07826 ± 0.044904 exhaustive 0.11304 ± 0.046738 Table: Comparison of the performance with different methods of feature selection, using non-lexical dataset
  • 26. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Feature selection Can we maintain high mapping quality when features are selected (semi)-automatically? Thesaurus Feature selection Error rate manual selection 0.11290 ± 0.025217 GTAA vs. Brinkman mutual information 0.09355 ± 0.044204 exhaustive 0.10323 ± 0.031533 manual selection 0.10000 ± 0.050413 GTAA vs. GTT mutual information 0.07826 ± 0.044904 exhaustive 0.11304 ± 0.046738 Table: Comparison of the performance with different methods of feature selection, using non-lexical dataset
  • 27. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Training set manually built golden standard (751) lexical seeding background seeding Thesauri lexical non-lexical GTAA vs. GTT 2720 116 GTAA vs. Brinkman 1372 323 Table: Numbers of positive examples in the training sets
  • 28. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Training set manually built golden standard (751) lexical seeding background seeding Thesauri lexical non-lexical GTAA vs. GTT 2720 116 GTAA vs. Brinkman 1372 323 Table: Numbers of positive examples in the training sets
  • 29. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Bias in the training sets Thesauri Training set Test set Error rate non-lexical non-lexical 0.09355 ± 0.044204 GTAA vs. lexical non-lexical 0.11501 Brinkman non-lexical lexical 0.07124 lexical lexical 0.04871 ± 0.029911 non-lexical non-lexical 0.07826 ± 0.044904 GTAA vs. lexical non-lexical 0.098712 GTT non-lexical lexical 0.088603 lexical lexical 0.06195 ± 0.008038 Table: Comparison using different datasets (feature selected using mutual information)
  • 30. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Bias in the training sets Thesauri Training set Test set Error rate non-lexical non-lexical 0.09355 ± 0.044204 GTAA vs. lexical non-lexical 0.11501 Brinkman non-lexical lexical 0.07124 lexical lexical 0.04871 ± 0.029911 non-lexical non-lexical 0.07826 ± 0.044904 GTAA vs. lexical non-lexical 0.098712 GTT non-lexical lexical 0.088603 lexical lexical 0.06195 ± 0.008038 Table: Comparison using different datasets (feature selected using mutual information)
  • 31. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Positive-negative ratios in the training sets 1 1 error rate error rate 0.9 0.9 F−measure F−measure pos pos Precision 0.8 Precisionpos 0.8 pos Recall pos Recall pos 0.7 0.7 0.6 0.6 Measures Measures 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 Ratio between positive and negative examples in the training set Ratio between positive and negative examples in the training set (a) testing on 1:1 data (b) testing on 1:1000 data Figure: The influence of positive-negative ratios — Brinkman vs. GTAA
  • 32. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Positive-negative ratios in the training sets (cont’) In practice, the training data should be chosen so as to contain a representative ratio of positive and negative examples, while still providing enough material for the classifier to have good predictive capacity on both types of examples.
  • 33. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment IV: Qualitative use of feature weights The value of learning results, λj , reflects the importance of the feature fj in the process of determining similarity (mappings) between concepts. KB fields BG fields kb:title bg:subject kb:abstract bg:subject kb:annotation bg:LOCATIES kb:annotation bg:SUBSIDIE kb:creator bg:contributor kb:creator bg:PERSOONSNAMEN kb:Date bg:OPNAMEDATUM kb:dateCopyrighted bg:date kb:description bg:subject kb:publisher bg:NAMEN kb:temporal bg:date
  • 34. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment IV: Qualitative use of feature weights The value of learning results, λj , reflects the importance of the feature fj in the process of determining similarity (mappings) between concepts. KB fields BG fields kb:title bg:subject kb:abstract bg:subject kb:annotation bg:LOCATIES kb:annotation bg:SUBSIDIE kb:creator bg:contributor kb:creator bg:PERSOONSNAMEN kb:Date bg:OPNAMEDATUM kb:dateCopyrighted bg:date kb:description bg:subject kb:publisher bg:NAMEN kb:temporal bg:date
  • 35. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Summary We use a machine learning method to automatically use the similarity between instances to determine mappings between concepts from different thesauri/ontologies. Enables mappings between thesauri used for very heterogeneous collections Does not require dually annotated instances Not limited by the language barrier A contribution to the field of meta-data mapping In the future More heterogeneous collections Smarter measures of similarity between instance metadata More similarity dimensions between concepts, e.g., lexical, structural
  • 36. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Thank you