The paper presents SERIMI, a class-based disambiguation approach for instance matching over heterogeneous web data. SERIMI uses a four step process: 1) clustering source instances into classes, 2) selecting blocking keys, 3) building pseudo-homonym sets, and 4) class-based disambiguation. It was evaluated on two datasets containing millions of triples and outperformed other approaches in precision, recall, and F1 score for instance matching when schemas have limited or no overlap.
Lifecycle support in architectures for ontology-based information systems - iswc
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heterogeneous Web Data
1. SERIMI: Class-based
Disambiguation for Effective
Instance Matching over
Heterogeneous Web Data
Samur Araujo, DucThanh Tran, Arjen de Vries,
Jan Hidders, Daniel Schwabe
Delft University of Technology
WebDB 2012
Delft
University of
Technology
2. Me You
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 2
3. Apple
Me
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 3
10. My Apple Your Apple
Spherical Shape Round Shape
Red Color Green Color
Eatable Eatable
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 10
11. My Apple Your Apple
Shape Shape
Color Color
Eatable Eatable
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 11
12. My Apple Your Apple
Spherical Shape Round Shape
Red Color Fruit Green Color
Eatable Eatable
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 12
13. My Apple Your Apple
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 13
14. Instance Matching
Source Target
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 14
15. “Instance matching uses a
direct comparison paradigm”.
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 15
17. Is your Apple like my
Apple?
Source
Humm..
Maybe!
Target
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 17
18. Homogenous data and schema.
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 18
19. The source and target
descriptions overlap.
Source Target
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 19
20. Syntactic Overlap
Population = TotalPopulation
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 20
21. Semantic Overlap
Population = Num_Inhabitants
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 21
22. Web of Data: heterogeneous
data and schema
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 22
23. None or limited overlap
between schemas
Source Target
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 23
24. Instances do not instantiate
the schema, properly.
Source Target
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 24
25. Apple
Nutritional Botanical
Information Information
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 25
26. “Direct comparison paradigm
does not apply”.
Source Target
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 26
27. Apple
Me
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 27
28. Apple
Orange
Me
Pineapple
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 28
29. Apple
Me Orange
You
Pineapple
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 29
33. Eatable
Food
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 36
34. Source
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 37
35. My Apple Your Apple
My Orange Your Orange
My Pineapple Your Pineapple
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 38
37. “We use a class-based
disambiguation paradigm …”
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 40
38. “We use a class-based
disambiguation paradigm …”
“… when there is no overlap
between schemas.”
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 41
39. Instance Matching with SERIMI
Source Target
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 42
40. Instance Representation
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 43
41. Instance Representation
Predicate
Instance Value
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 44
42. Instance Representation
shape
Apple1 Round
title
Apple1 Apple
color
Apple1 Red
category
Apple1 Eatable
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 45
43. Instance Representation
shape
Apple1 Round
title
Apple1 Apple
color
Apple1 Red
category
Apple1 Eatable
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 46
44. Instance Representation
shape
Apple1 Round
title
Apple1 Apple
color
Apple1 Red
category
Apple1 Eatable
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 47
45. Instance Representation
[P(hi), D(hi), O(hi), T(hi)]
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 48
46. Step 1: Cluster the source
Cars
Source
Fruits
Companies
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 49
47. Step 2: Blocking Key Selection
Key
Selection
Source
instances
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 51
48. Step 2: Blocking Key Selection
key
Key
key
Selection
key
Source
instances
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 52
49. Step 2: Blocking Key Selection
e.g.Title
key
Key
key
Selection
key
Source
instances
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 53
50. Step 3: Pseudo-Homonyms Builder
Title=apple Pseudo-
Title=orange Homonyms
Title=pineapple Builder
Target
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 54
51. Step 3: Pseudo-Homonyms Builder
Everything
Target called Apple
Pseudo-
Homonyms
Builder
Source
instances
Target
Pseudo-homonyms
sets
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 55
52. Step 4: Class-based disambiguation
Target
Disambiguation
Class-based
Disambiguato
r
Pseudo-homonyms
sets
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 56
53. Step 4: Class-based disambiguation
Target
Target
Disambiguation
Class-based
Disambiguato
r Source
instances
Pseudo-homonyms
Pseudo-homonyms sets
sets
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 57
54. Step 4: Class-based disambiguation
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 58
55. Step 4: Class-based disambiguation
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 59
56. Step 4: Class-based disambiguation
h11 h21 h31
instances
h12 h22 h32
h13 h33
h14
H1 H2 H3
pseudo-homonym sets
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 60
57. Step 4: Class-based disambiguation
[P(hi11), D(hi11), O(hi11), T(hi11)]
h11 h21 h31
instances
h12 h22 h32
h13 h33
h14
H1 H2 H3
pseudo-homonym sets
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 61
58. Instance Representation
shape
Apple1 Round
title
Apple1 Apple
color
Apple1 Red
category
Apple1 Eatable
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 62
64. Step 4: Class-based disambiguation
0.98 0.95 0.94
0.32 0.53 0.91
0.32 0.87
0.76
H1 H2 H3
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 68
65. Step 4: Class-based disambiguation
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 69
66. Step 4: Class-based disambiguation
0.98 0.95 0.94
TOP-K or Threshold
H1 H2 H3
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 70
67. Step 4: Class-based disambiguation
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 71
68. Step 4: Class-based disambiguation
Target
Disambiguation
Class-based
Disambiguato
r Source
instances
Pseudo-homonyms
sets
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 72
69. Experiment
• Ontology AlignmentEvaluation Initiative (OAEI 2010)
• Collections: the life science (LS) collection (DBPedia, Sider,
Drugbank, LinkedCT, Dailymed, TCM, and Diseasome) and the
Person-Restaurant (PR)
• 20 gigabytes of data, millions of triples.
• We compared SERIMI to ObjectCoref and RiMON
• Precision, Recall and F1
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 73
70. Results
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 74
71. Results
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 75
72. Results
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 76
74. Results
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 80
75. Step 4: Class-based disambiguation
0.98 0.95 0.94
TOP-K or Threshold
H1 H2 H3
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 81
76. Results for Top-K
1.00
0.90
0.80
0.70
0.60
0.50 Top-1
F1
0.40 Top-2
0.30 Top-5
0.20 Top-10
0.10
0.00
Sider-Daily. Sider-Drug. Drug.-Sider P11-P12
Dataset Pair
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 82
77. Results for δ threshold
1.00
0.90
0.80
0.70
0.60
δ >= δm
0.50 δ = 1.0
F1
0.40 δ >= 0.95
δ >= 0.90
0.30
δ >= 0.85
0.20
0.10
0.00
Sider-Daily. Sider-Drug. Drug.-Sider P11-P12
Dataset Pair
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 83
78. Conclusion
• SERIMI is complementary approach to direct-match based
instance matching tools.
• SERIMI is recommended for heterogeneous data where there is
no overlap between schemas.
• It is recommended for multi-class disambiguation.
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 84
79. THANK YOU!
• Samur Araujo
s.f.cardosodearaujo@tudelft.nl
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data
SERIMI: Class-based Disambiguation for Effective
Instance Matching over Heterogeneous Web Data 85