SlideShare una empresa de Scribd logo
1 de 79
SERIMI: Class-based
 Disambiguation for Effective
 Instance Matching over
 Heterogeneous Web Data

Samur Araujo, DucThanh Tran, Arjen de Vries,
Jan Hidders, Daniel Schwabe

Delft University of Technology
WebDB 2012

         Delft
         University of
         Technology
Me                                                      You




     SERIMI: Class-based Disambiguation for Effective
     Instance Matching over Heterogeneous Web Data            2
Apple
Me




     SERIMI: Class-based Disambiguation for Effective
     Instance Matching over Heterogeneous Web Data      3
You




SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data            4
?
                                                       You

Ambiguous

    SERIMI: Class-based Disambiguation for Effective
    Instance Matching over Heterogeneous Web Data            5
Me




     SERIMI: Class-based Disambiguation for Effective
     Instance Matching over Heterogeneous Web Data      6
Me




     SERIMI: Class-based Disambiguation for Effective
     Instance Matching over Heterogeneous Web Data      7
Me




     SERIMI: Class-based Disambiguation for Effective
     Instance Matching over Heterogeneous Web Data      8
You




SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data            9
My Apple                                                       Your Apple
Spherical Shape                                                  Round Shape
     Red Color                                                   Green Color
       Eatable                                                   Eatable




                  SERIMI: Class-based Disambiguation for Effective
                  Instance Matching over Heterogeneous Web Data            10
My Apple                                                    Your Apple
   Shape                                                    Shape
   Color                                                    Color
   Eatable                                                  Eatable




             SERIMI: Class-based Disambiguation for Effective
             Instance Matching over Heterogeneous Web Data            11
My Apple                                                       Your Apple
Spherical Shape                                                  Round Shape

     Red Color       Fruit                                       Green Color

       Eatable                                                   Eatable




                  SERIMI: Class-based Disambiguation for Effective
                  Instance Matching over Heterogeneous Web Data            12
My Apple                                   Your Apple



      SERIMI: Class-based Disambiguation for Effective
      Instance Matching over Heterogeneous Web Data      13
Instance Matching




Source                                                   Target

          SERIMI: Class-based Disambiguation for Effective
          Instance Matching over Heterogeneous Web Data      14
“Instance matching uses a
direct comparison paradigm”.




        SERIMI: Class-based Disambiguation for Effective
        Instance Matching over Heterogeneous Web Data      15
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data      16
Is your Apple like my
                        Apple?


Source
                                      Humm..
                                      Maybe!

                                                            Target
         SERIMI: Class-based Disambiguation for Effective
         Instance Matching over Heterogeneous Web Data         17
Homogenous data and schema.




        SERIMI: Class-based Disambiguation for Effective
        Instance Matching over Heterogeneous Web Data      18
The source and target
descriptions overlap.

                                                     Source   Target



        SERIMI: Class-based Disambiguation for Effective
        Instance Matching over Heterogeneous Web Data         19
Syntactic Overlap



Population = TotalPopulation




         SERIMI: Class-based Disambiguation for Effective
         Instance Matching over Heterogeneous Web Data      20
Semantic Overlap



Population = Num_Inhabitants




         SERIMI: Class-based Disambiguation for Effective
         Instance Matching over Heterogeneous Web Data      21
Web of Data: heterogeneous
data and schema




        SERIMI: Class-based Disambiguation for Effective
        Instance Matching over Heterogeneous Web Data      22
None or limited overlap
between schemas

                                  Source                 Target



      SERIMI: Class-based Disambiguation for Effective
      Instance Matching over Heterogeneous Web Data       23
Instances do not instantiate
the schema, properly.

                                  Source                 Target



      SERIMI: Class-based Disambiguation for Effective
      Instance Matching over Heterogeneous Web Data       24
Apple



 Nutritional                            Botanical
Information                            Information




        SERIMI: Class-based Disambiguation for Effective
        Instance Matching over Heterogeneous Web Data      25
“Direct comparison paradigm
does not apply”.


                                    Source                 Target



        SERIMI: Class-based Disambiguation for Effective
        Instance Matching over Heterogeneous Web Data       26
Apple
Me




     SERIMI: Class-based Disambiguation for Effective
     Instance Matching over Heterogeneous Web Data      27
Apple
                              Orange
Me
                              Pineapple


     SERIMI: Class-based Disambiguation for Effective
     Instance Matching over Heterogeneous Web Data      28
Apple

Me    Orange
                                                                You

      Pineapple
             SERIMI: Class-based Disambiguation for Effective
             Instance Matching over Heterogeneous Web Data      29
You



SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data      32
Food
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data      34
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data      35
Eatable
                                                   Food
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data       36
Source




 SERIMI: Class-based Disambiguation for Effective
 Instance Matching over Heterogeneous Web Data      37
My Apple                                    Your Apple


 My Orange                                   Your Orange


My Pineapple                                Your Pineapple

         SERIMI: Class-based Disambiguation for Effective
         Instance Matching over Heterogeneous Web Data      38
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data      39
“We use a class-based
disambiguation paradigm …”




        SERIMI: Class-based Disambiguation for Effective
        Instance Matching over Heterogeneous Web Data      40
“We use a class-based
disambiguation paradigm …”

“… when there is no overlap
between schemas.”

        SERIMI: Class-based Disambiguation for Effective
        Instance Matching over Heterogeneous Web Data      41
Instance Matching with SERIMI




Source                                                   Target

          SERIMI: Class-based Disambiguation for Effective
          Instance Matching over Heterogeneous Web Data      42
Instance Representation




          SERIMI: Class-based Disambiguation for Effective
          Instance Matching over Heterogeneous Web Data      43
Instance Representation


              Predicate
   Instance                             Value




              SERIMI: Class-based Disambiguation for Effective
              Instance Matching over Heterogeneous Web Data      44
Instance Representation


           shape
  Apple1                            Round
            title
  Apple1                            Apple
           color
  Apple1                            Red
           category
  Apple1                            Eatable




           SERIMI: Class-based Disambiguation for Effective
           Instance Matching over Heterogeneous Web Data      45
Instance Representation
           shape
  Apple1                            Round
            title
  Apple1                            Apple
           color
  Apple1                            Red
           category
  Apple1                             Eatable




           SERIMI: Class-based Disambiguation for Effective
           Instance Matching over Heterogeneous Web Data      46
Instance Representation
           shape
  Apple1                            Round
            title
  Apple1                            Apple
           color
  Apple1                            Red
           category
  Apple1                             Eatable




           SERIMI: Class-based Disambiguation for Effective
           Instance Matching over Heterogeneous Web Data      47
Instance Representation




      [P(hi), D(hi), O(hi), T(hi)]



               SERIMI: Class-based Disambiguation for Effective
               Instance Matching over Heterogeneous Web Data      48
Step 1: Cluster the source

                                                              Cars




 Source
                                                                          Fruits
                                                    Companies
           SERIMI: Class-based Disambiguation for Effective
           Instance Matching over Heterogeneous Web Data             49
Step 2: Blocking Key Selection




              Key
            Selection
 Source
instances



                        SERIMI: Class-based Disambiguation for Effective
                        Instance Matching over Heterogeneous Web Data      51
Step 2: Blocking Key Selection




                                                       key
              Key
                                                       key
            Selection
                                                       key
 Source
instances



                        SERIMI: Class-based Disambiguation for Effective
                        Instance Matching over Heterogeneous Web Data      52
Step 2: Blocking Key Selection


                                                                           e.g.Title
                                                       key
              Key
                                                       key
            Selection
                                                       key
 Source
instances



                        SERIMI: Class-based Disambiguation for Effective
                        Instance Matching over Heterogeneous Web Data              53
Step 3: Pseudo-Homonyms Builder




   Title=apple                     Pseudo-
   Title=orange                   Homonyms
   Title=pineapple                 Builder



                                      Target


               SERIMI: Class-based Disambiguation for Effective
               Instance Matching over Heterogeneous Web Data      54
Step 3: Pseudo-Homonyms Builder
                                                                Everything
                                             Target            called Apple



  Pseudo-
 Homonyms
  Builder
              Source
             instances
  Target
                                Pseudo-homonyms
                                       sets
            SERIMI: Class-based Disambiguation for Effective
            Instance Matching over Heterogeneous Web Data           55
Step 4: Class-based disambiguation

    Target
                  Disambiguation

                      Class-based
                     Disambiguato
                           r


Pseudo-homonyms
       sets
                  SERIMI: Class-based Disambiguation for Effective
                  Instance Matching over Heterogeneous Web Data      56
Step 4: Class-based disambiguation
                                                                                Target
    Target
                  Disambiguation

                    Class-based
                   Disambiguato
                         r                                        Source
                                                                 instances

                                                                             Pseudo-homonyms
Pseudo-homonyms                                                                     sets
       sets
                    SERIMI: Class-based Disambiguation for Effective
                    Instance Matching over Heterogeneous Web Data               57
Step 4: Class-based disambiguation




          SERIMI: Class-based Disambiguation for Effective
          Instance Matching over Heterogeneous Web Data      58
Step 4: Class-based disambiguation




          SERIMI: Class-based Disambiguation for Effective
          Instance Matching over Heterogeneous Web Data      59
Step 4: Class-based disambiguation



            h11   h21   h31
instances




            h12   h22   h32
            h13         h33
            h14

            H1    H2    H3

     pseudo-homonym sets

                              SERIMI: Class-based Disambiguation for Effective
                              Instance Matching over Heterogeneous Web Data      60
Step 4: Class-based disambiguation
                  [P(hi11), D(hi11), O(hi11), T(hi11)]



            h11     h21    h31
instances




            h12     h22    h32
            h13            h33
            h14

            H1      H2     H3

     pseudo-homonym sets

                                   SERIMI: Class-based Disambiguation for Effective
                                   Instance Matching over Heterogeneous Web Data      61
Instance Representation
           shape
  Apple1                            Round
            title
  Apple1                            Apple
           color
  Apple1                            Red
           category
  Apple1                             Eatable




           SERIMI: Class-based Disambiguation for Effective
           Instance Matching over Heterogeneous Web Data      62
Step 4: Class-based disambiguation



            h11   h21   h31                     0.98         0.95         0.94
instances




            h12   h22   h32                      0.32        0.53         0.91
            h13         h33                      0.32                     0.87

            h14                                  0.76


            H1    H2    H3                        H1           H2          H3

     pseudo-homonym sets

                              SERIMI: Class-based Disambiguation for Effective
                              Instance Matching over Heterogeneous Web Data      63
Step 4: Class-based disambiguation



h11   h21   h31
h12   h22   h32
h13         h33
h14

H1    H2    H3




                  SERIMI: Class-based Disambiguation for Effective
                  Instance Matching over Heterogeneous Web Data      64
Step 4: Class-based disambiguation
      [P(hi11), D(hi11), O(hi11), T(hi11)]



h11     h21    h31
h12     h22    h32
h13            h33
h14

H1      H2     H3




                       SERIMI: Class-based Disambiguation for Effective
                       Instance Matching over Heterogeneous Web Data      65
Step 4: Class-based disambiguation




          SERIMI: Class-based Disambiguation for Effective
          Instance Matching over Heterogeneous Web Data      66
Step 4: Class-based disambiguation



h11   h21   h31    0.98           h21         h31
h12   h22   h32      h12          h22         h32
h13         h33      h13                      h33
h14                  h14

H1    H2    H3       H1           H2          H3




                  SERIMI: Class-based Disambiguation for Effective
                  Instance Matching over Heterogeneous Web Data      67
Step 4: Class-based disambiguation



0.98   0.95   0.94
0.32   0.53   0.91
0.32          0.87
0.76


H1      H2    H3




                     SERIMI: Class-based Disambiguation for Effective
                     Instance Matching over Heterogeneous Web Data      68
Step 4: Class-based disambiguation




          SERIMI: Class-based Disambiguation for Effective
          Instance Matching over Heterogeneous Web Data      69
Step 4: Class-based disambiguation



0.98   0.95   0.94


                           TOP-K or Threshold
H1      H2    H3




                     SERIMI: Class-based Disambiguation for Effective
                     Instance Matching over Heterogeneous Web Data      70
Step 4: Class-based disambiguation




          SERIMI: Class-based Disambiguation for Effective
          Instance Matching over Heterogeneous Web Data      71
Step 4: Class-based disambiguation
                                                                       Target
         Disambiguation

           Class-based
          Disambiguato
                r                                        Source
                                                        instances

                                                                    Pseudo-homonyms
                                                                           sets


           SERIMI: Class-based Disambiguation for Effective
           Instance Matching over Heterogeneous Web Data               72
Experiment

• Ontology AlignmentEvaluation Initiative (OAEI 2010)

• Collections: the life science (LS) collection (DBPedia, Sider,
  Drugbank, LinkedCT, Dailymed, TCM, and Diseasome) and the
  Person-Restaurant (PR)

• 20 gigabytes of data, millions of triples.

• We compared SERIMI to ObjectCoref and RiMON

• Precision, Recall and F1

                      SERIMI: Class-based Disambiguation for Effective
                      Instance Matching over Heterogeneous Web Data      73
Results




          SERIMI: Class-based Disambiguation for Effective
          Instance Matching over Heterogeneous Web Data      74
Results




          SERIMI: Class-based Disambiguation for Effective
          Instance Matching over Heterogeneous Web Data      75
Results




          SERIMI: Class-based Disambiguation for Effective
          Instance Matching over Heterogeneous Web Data      76
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data      77
Results




          SERIMI: Class-based Disambiguation for Effective
          Instance Matching over Heterogeneous Web Data      80
Step 4: Class-based disambiguation



0.98   0.95   0.94


                           TOP-K or Threshold
H1      H2    H3




                     SERIMI: Class-based Disambiguation for Effective
                     Instance Matching over Heterogeneous Web Data      81
Results for Top-K
     1.00

     0.90

     0.80

     0.70

     0.60

     0.50                                                                                        Top-1
F1




     0.40                                                                                        Top-2
     0.30                                                                                        Top-5
     0.20                                                                                        Top-10
     0.10

     0.00


             Sider-Daily.   Sider-Drug.              Drug.-Sider                  P11-P12
                                          Dataset Pair


                               SERIMI: Class-based Disambiguation for Effective
                               Instance Matching over Heterogeneous Web Data                82
Results for δ threshold
     1.00

     0.90

     0.80

     0.70

     0.60
                                                                                            δ   >= δm
     0.50                                                                                   δ   = 1.0
F1




     0.40                                                                                   δ   >= 0.95
                                                                                            δ   >= 0.90
     0.30
                                                                                            δ   >= 0.85
     0.20

     0.10

     0.00

            Sider-Daily.   Sider-Drug.              Drug.-Sider                  P11-P12
                                         Dataset Pair



                              SERIMI: Class-based Disambiguation for Effective
                              Instance Matching over Heterogeneous Web Data                83
Conclusion

• SERIMI is complementary approach to direct-match based
  instance matching tools.

• SERIMI is recommended for heterogeneous data where there is
  no overlap between schemas.

• It is recommended for multi-class disambiguation.




                    SERIMI: Class-based Disambiguation for Effective
                    Instance Matching over Heterogeneous Web Data      84
THANK YOU!

• Samur Araujo
s.f.cardosodearaujo@tudelft.nl


     SERIMI: Class-based Disambiguation for Effective
      Instance Matching over Heterogeneous Web Data




                    SERIMI: Class-based Disambiguation for Effective
                    Instance Matching over Heterogeneous Web Data      85

Más contenido relacionado

Destacado

Keyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsKeyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsThanh Tran
 
Гастро-тур в Италию
Гастро-тур в ИталиюГастро-тур в Италию
Гастро-тур в ИталиюEasyWays
 
Linked Data Query Processing Strategies
Linked Data Query Processing StrategiesLinked Data Query Processing Strategies
Linked Data Query Processing StrategiesThanh Tran
 
Index Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesIndex Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesThanh Tran
 
Semantic Web Search - Searching Documents and Semantic Data on the Web
Semantic Web Search - Searching Documents and Semantic Data on the WebSemantic Web Search - Searching Documents and Semantic Data on the Web
Semantic Web Search - Searching Documents and Semantic Data on the WebThanh Tran
 
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...Thanh Tran
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesThanh Tran
 
Query Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the WebQuery Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the WebThanh Tran
 
поляризация диэлектриков
поляризация диэлектриковполяризация диэлектриков
поляризация диэлектриковAndronovaAnna
 
Lifecycle support in architectures for ontology-based information systems - iswc
Lifecycle support in architectures for ontology-based information systems - iswcLifecycle support in architectures for ontology-based information systems - iswc
Lifecycle support in architectures for ontology-based information systems - iswcThanh Tran
 

Destacado (10)

Keyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsKeyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance Models
 
Гастро-тур в Италию
Гастро-тур в ИталиюГастро-тур в Италию
Гастро-тур в Италию
 
Linked Data Query Processing Strategies
Linked Data Query Processing StrategiesLinked Data Query Processing Strategies
Linked Data Query Processing Strategies
 
Index Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesIndex Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search Databases
 
Semantic Web Search - Searching Documents and Semantic Data on the Web
Semantic Web Search - Searching Documents and Semantic Data on the WebSemantic Web Search - Searching Documents and Semantic Data on the Web
Semantic Web Search - Searching Documents and Semantic Data on the Web
 
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search Technologies
 
Query Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the WebQuery Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the Web
 
поляризация диэлектриков
поляризация диэлектриковполяризация диэлектриков
поляризация диэлектриков
 
Lifecycle support in architectures for ontology-based information systems - iswc
Lifecycle support in architectures for ontology-based information systems - iswcLifecycle support in architectures for ontology-based information systems - iswc
Lifecycle support in architectures for ontology-based information systems - iswc
 

SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data

  • 1. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data Samur Araujo, DucThanh Tran, Arjen de Vries, Jan Hidders, Daniel Schwabe Delft University of Technology WebDB 2012 Delft University of Technology
  • 2. Me You SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 2
  • 3. Apple Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 3
  • 4. You SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 4
  • 5. ? You Ambiguous SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 5
  • 6. Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 6
  • 7. Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 7
  • 8. Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 8
  • 9. You SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 9
  • 10. My Apple Your Apple Spherical Shape Round Shape Red Color Green Color Eatable Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 10
  • 11. My Apple Your Apple Shape Shape Color Color Eatable Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 11
  • 12. My Apple Your Apple Spherical Shape Round Shape Red Color Fruit Green Color Eatable Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 12
  • 13. My Apple Your Apple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 13
  • 14. Instance Matching Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 14
  • 15. “Instance matching uses a direct comparison paradigm”. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 15
  • 16. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 16
  • 17. Is your Apple like my Apple? Source Humm.. Maybe! Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 17
  • 18. Homogenous data and schema. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 18
  • 19. The source and target descriptions overlap. Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 19
  • 20. Syntactic Overlap Population = TotalPopulation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 20
  • 21. Semantic Overlap Population = Num_Inhabitants SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 21
  • 22. Web of Data: heterogeneous data and schema SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 22
  • 23. None or limited overlap between schemas Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 23
  • 24. Instances do not instantiate the schema, properly. Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 24
  • 25. Apple Nutritional Botanical Information Information SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 25
  • 26. “Direct comparison paradigm does not apply”. Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 26
  • 27. Apple Me SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 27
  • 28. Apple Orange Me Pineapple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 28
  • 29. Apple Me Orange You Pineapple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 29
  • 30. You SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 32
  • 31. Food SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 34
  • 32. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 35
  • 33. Eatable Food SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 36
  • 34. Source SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 37
  • 35. My Apple Your Apple My Orange Your Orange My Pineapple Your Pineapple SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 38
  • 36. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 39
  • 37. “We use a class-based disambiguation paradigm …” SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 40
  • 38. “We use a class-based disambiguation paradigm …” “… when there is no overlap between schemas.” SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 41
  • 39. Instance Matching with SERIMI Source Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 42
  • 40. Instance Representation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 43
  • 41. Instance Representation Predicate Instance Value SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 44
  • 42. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 45
  • 43. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 46
  • 44. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 47
  • 45. Instance Representation [P(hi), D(hi), O(hi), T(hi)] SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 48
  • 46. Step 1: Cluster the source Cars Source Fruits Companies SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 49
  • 47. Step 2: Blocking Key Selection Key Selection Source instances SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 51
  • 48. Step 2: Blocking Key Selection key Key key Selection key Source instances SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 52
  • 49. Step 2: Blocking Key Selection e.g.Title key Key key Selection key Source instances SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 53
  • 50. Step 3: Pseudo-Homonyms Builder Title=apple Pseudo- Title=orange Homonyms Title=pineapple Builder Target SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 54
  • 51. Step 3: Pseudo-Homonyms Builder Everything Target called Apple Pseudo- Homonyms Builder Source instances Target Pseudo-homonyms sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 55
  • 52. Step 4: Class-based disambiguation Target Disambiguation Class-based Disambiguato r Pseudo-homonyms sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 56
  • 53. Step 4: Class-based disambiguation Target Target Disambiguation Class-based Disambiguato r Source instances Pseudo-homonyms Pseudo-homonyms sets sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 57
  • 54. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 58
  • 55. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 59
  • 56. Step 4: Class-based disambiguation h11 h21 h31 instances h12 h22 h32 h13 h33 h14 H1 H2 H3 pseudo-homonym sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 60
  • 57. Step 4: Class-based disambiguation [P(hi11), D(hi11), O(hi11), T(hi11)] h11 h21 h31 instances h12 h22 h32 h13 h33 h14 H1 H2 H3 pseudo-homonym sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 61
  • 58. Instance Representation shape Apple1 Round title Apple1 Apple color Apple1 Red category Apple1 Eatable SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 62
  • 59. Step 4: Class-based disambiguation h11 h21 h31 0.98 0.95 0.94 instances h12 h22 h32 0.32 0.53 0.91 h13 h33 0.32 0.87 h14 0.76 H1 H2 H3 H1 H2 H3 pseudo-homonym sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 63
  • 60. Step 4: Class-based disambiguation h11 h21 h31 h12 h22 h32 h13 h33 h14 H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 64
  • 61. Step 4: Class-based disambiguation [P(hi11), D(hi11), O(hi11), T(hi11)] h11 h21 h31 h12 h22 h32 h13 h33 h14 H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 65
  • 62. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 66
  • 63. Step 4: Class-based disambiguation h11 h21 h31 0.98 h21 h31 h12 h22 h32 h12 h22 h32 h13 h33 h13 h33 h14 h14 H1 H2 H3 H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 67
  • 64. Step 4: Class-based disambiguation 0.98 0.95 0.94 0.32 0.53 0.91 0.32 0.87 0.76 H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 68
  • 65. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 69
  • 66. Step 4: Class-based disambiguation 0.98 0.95 0.94 TOP-K or Threshold H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 70
  • 67. Step 4: Class-based disambiguation SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 71
  • 68. Step 4: Class-based disambiguation Target Disambiguation Class-based Disambiguato r Source instances Pseudo-homonyms sets SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 72
  • 69. Experiment • Ontology AlignmentEvaluation Initiative (OAEI 2010) • Collections: the life science (LS) collection (DBPedia, Sider, Drugbank, LinkedCT, Dailymed, TCM, and Diseasome) and the Person-Restaurant (PR) • 20 gigabytes of data, millions of triples. • We compared SERIMI to ObjectCoref and RiMON • Precision, Recall and F1 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 73
  • 70. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 74
  • 71. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 75
  • 72. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 76
  • 73. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 77
  • 74. Results SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 80
  • 75. Step 4: Class-based disambiguation 0.98 0.95 0.94 TOP-K or Threshold H1 H2 H3 SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 81
  • 76. Results for Top-K 1.00 0.90 0.80 0.70 0.60 0.50 Top-1 F1 0.40 Top-2 0.30 Top-5 0.20 Top-10 0.10 0.00 Sider-Daily. Sider-Drug. Drug.-Sider P11-P12 Dataset Pair SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 82
  • 77. Results for δ threshold 1.00 0.90 0.80 0.70 0.60 δ >= δm 0.50 δ = 1.0 F1 0.40 δ >= 0.95 δ >= 0.90 0.30 δ >= 0.85 0.20 0.10 0.00 Sider-Daily. Sider-Drug. Drug.-Sider P11-P12 Dataset Pair SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 83
  • 78. Conclusion • SERIMI is complementary approach to direct-match based instance matching tools. • SERIMI is recommended for heterogeneous data where there is no overlap between schemas. • It is recommended for multi-class disambiguation. SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 84
  • 79. THANK YOU! • Samur Araujo s.f.cardosodearaujo@tudelft.nl SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data 85