Slides for the paper "A Comparison of Supervised Learning Classifiers for Link Discovery" by Tommaso Soru and Axel-Cyrille Ngonga Ngomo (AKSW, University of Leipzig), presented on September 4, 2014 at the 10th International Conference on Semantic Systems (SEMANTiCS) in Leipzig, Germany.
2. ers
for Link Discovery
Tommaso Soru and Axel-Cyrille Ngonga Ngomo
Agile Knowledge Engineering and Semantic Web
Department of Computer Science
University of Leipzig
Augustusplatz 10, 04109 Leipzig
ftsoru,ngongag@informatik.uni-leipzig.de
http://aksw.org
September 4, 2014
3. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Introduction/1
The 4th Linked Data Web Principle.
Include links to other URIs, so that they can discover more
things." { Tim Berners-Lee
31B triples in 2011
of which only 3% link
dierent datasets
71B triples expected in
2014
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
5. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Introduction/2
Link Discovery
What? Discover new links among resources.
How? Using supervised and unsupervised methods.
Why? Links are important for data integration, question
answering, knowledge extraction.
We will focus on supervised machine-learning algorithms.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
7. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Introduction/2
Link Discovery
What? Discover new links among resources.
How? Using supervised and unsupervised methods.
Why? Links are important for data integration, question
answering, knowledge extraction.
We will focus on supervised machine-learning algorithms.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
9. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Preliminaries
Link Discovery.
Given two datasets S and T, the general aim of link discovery is to
10. nd the set
of resource pairs (s; t) 2 S T such that R(s; t) holds, where R is a given
relation such as owl:sameAs or dbp:near.
Link Speci
12. cation is a rule composed by a complex similarity function sim and
a threshold that de
13. nes which pairs (s; t) should be linked together:
sim(s; t)
Main problems
1 Nave approaches demand quadratic time complexity.
2 Ecient algorithms ; accurate link speci
14. cations.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
16. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Preliminaries
Link Discovery.
Given two datasets S and T, the general aim of link discovery is to
17. nd the set
of resource pairs (s; t) 2 S T such that R(s; t) holds, where R is a given
relation such as owl:sameAs or dbp:near.
Link Speci
19. cation is a rule composed by a complex similarity function sim and
a threshold that de
20. nes which pairs (s; t) should be linked together:
sim(s; t)
Main problems
1 Nave approaches demand quadratic time complexity.
2 Ecient algorithms ; accurate link speci
21. cations.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
23. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Preliminaries
Link Discovery.
Given two datasets S and T, the general aim of link discovery is to
24. nd the set
of resource pairs (s; t) 2 S T such that R(s; t) holds, where R is a given
relation such as owl:sameAs or dbp:near.
Link Speci
26. cation is a rule composed by a complex similarity function sim and
a threshold that de
27. nes which pairs (s; t) should be linked together:
sim(s; t)
Main problems
1 Nave approaches demand quadratic time complexity.
2 Ecient algorithms ; accurate link speci
28. cations.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
30. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Motivation
We want to answer these questions.
Q1: Which of the paradigms achieves the best F-measures?
Q2: Which of the paradigms is most robust against noise?
Q3: Which of the methods is the most time-ecient?
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
32. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Motivation
We want to answer these questions.
Q1: Which of the paradigms achieves the best F-measures?
Q2: Which of the paradigms is most robust against noise?
Q3: Which of the methods is the most time-ecient?
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
34. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Motivation
We want to answer these questions.
Q1: Which of the paradigms achieves the best F-measures?
Q2: Which of the paradigms is most robust against noise?
Q3: Which of the methods is the most time-ecient?
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
36. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Overview/1
Evaluation pipeline
Alignment between properties is carried out manually.
Perfect mapping (i.e., labels)
(s; t) is a positive example i R(s; t) holds.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
38. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Overview/2
Assumptions
The complex similarity function sim compares property values.
In case of
datatype properties: it uses text/numerical/date similarities.
object properties: it applies the similarities iteratively.
Graph structure has not been considered as a feature per se.
Cross-validation has been preferred over semi-supervised
learning because it yields more accurate results.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
40. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Evaluation Setup/1
Similarities
for string values:
Weighted trigram similarity, setting tf-idf scores as weights
Weighted edit distance, setting confusion matrices as weights
Cosine similarity
for numerical values:
Logarithmic similarity
for date values:
a day-based Date similarity
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
48. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Evaluation Setup/3
Datasets
D1-D3: synthetic datasets from the Ontology Alignment Evaluation
Initiative (OAEI) 2010 Benchmark
D4-D6: real datasets from the Benchmark for Entity Resolution, DBS
Leipzig
D5-D6: datasets having a high level of noise
# dataset domain size
D1 OAEI-Persons1 personal data 250k
D2 OAEI-Persons2 personal data 240k
D3 OAEI-Restaurants places 72k
D4 DBLP{ACM bibliographic 6M
D5 Amazon{GoogleProducts e-commerce 10M
D6 ABT{Buy e-commerce 1M
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
56. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Results/3
Considerations
Some average trends can be
suggested, yet no algorithm
outperforms all other signi
57. cantly.
Multilayer Perceptrons performed
best including and excluding noisy
datasets.
Random Trees seem the fastest
approach overall.
The dierent approaches seem
complementary on their
behaviour.
Nave Bayes might fail as it
considers all features as
independent from each other.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
59. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Results/4
Answers
Q1: Which of the paradigms achieves the best F-measures?
A1: Multilayer Perceptrons, Linear SVMs, Decision Tables.
Q2: Which of the paradigms is most robust against noise?
A2: Logistic Regression, Random Trees, Multilayer Perceptrons.
Q3: Which of the methods is the most time-ecient?
A3: Random Trees, however all approaches scale well.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
61. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Results/4
Answers
Q1: Which of the paradigms achieves the best F-measures?
A1: Multilayer Perceptrons, Linear SVMs, Decision Tables.
Q2: Which of the paradigms is most robust against noise?
A2: Logistic Regression, Random Trees, Multilayer Perceptrons.
Q3: Which of the methods is the most time-ecient?
A3: Random Trees, however all approaches scale well.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
63. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Results/4
Answers
Q1: Which of the paradigms achieves the best F-measures?
A1: Multilayer Perceptrons, Linear SVMs, Decision Tables.
Q2: Which of the paradigms is most robust against noise?
A2: Logistic Regression, Random Trees, Multilayer Perceptrons.
Q3: Which of the methods is the most time-ecient?
A3: Random Trees, however all approaches scale well.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
65. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Results/4
Answers
Q1: Which of the paradigms achieves the best F-measures?
A1: Multilayer Perceptrons, Linear SVMs, Decision Tables.
Q2: Which of the paradigms is most robust against noise?
A2: Logistic Regression, Random Trees, Multilayer Perceptrons.
Q3: Which of the methods is the most time-ecient?
A3: Random Trees, however all approaches scale well.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
67. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Related Work
Time-ecient deduplication algorithms (PPJoin+, EDJoin,
PassJoin, TrieJoin)
LIMES { Link Discovery Framework for Metric Spaces
Approaches for learning link speci
68. cations (HYPPO, HR3,
EAGLE, ACIDS)
Dedicated ecient methods (RDF-AI, REEDED)
LinkLion { A Link Repository for the Web of Data
The SAIM interface
Other link discovery frameworks (SILK, LDIF)
Other machine learning frameworks (MARLIN, FEBRL,
RAVEN)
Other blocking techniques (MultiBlock, KnoFuss)
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
70. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Future Work
1 Integration of Multilayer Perceptrons into the LIMES
framework.
2 Use of ensemble learning techniques.
3 Evaluation on a semi-supervised learning setting with few
training data.
4 Evaluation using a larger amount of similarity measures.
5 Incorporation of a component based on Statistical Relational
Learning.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
72. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Web resources
Source code { Batch Learners Evaluation for Link Discovery
http://github.com/mommi84/BALLAD
Technical report { Batch Learners Evaluation for Link Discovery
http://mommi84.github.io/BALLAD
The OAEI 2010 Benchmark
http://oaei.ontologymatching.org/2010/benchmarks
The Benchmark for Entity Resolution, DBS Leipzig
http://goo.gl/bvWBjA
Weka { Data Mining Software in Java
http://www.cs.waikato.ac.nz/ml/weka
LibSVM { A Library for Support Vector Machines
http://www.csie.ntu.edu.tw/~cjlin/libsvm
LIMES { Link Discovery Framework for Metric Spaces
http://aksw.org/Projects/LIMES
LinkLion { A Link Repository for the Web of Data
http://www.linklion.org
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
74. tugraz
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems
Thank you for your attention.
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi