A Comparison of Supervised Learning Classifiers for Link Discovery

1. A Comparison of Supervised Learning Classi

2. ers for Link Discovery Tommaso Soru and Axel-Cyrille Ngonga Ngomo Agile Knowledge Engineering and Semantic Web Department of Computer Science University of Leipzig Augustusplatz 10, 04109 Leipzig ftsoru,ngongag@informatik.uni-leipzig.de http://aksw.org September 4, 2014

3. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Introduction/1 The 4th Linked Data Web Principle. Include links to other URIs, so that they can discover more things." { Tim Berners-Lee 31B triples in 2011 of which only 3% link dierent datasets 71B triples expected in 2014 T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

4. ers for Link Discovery 2 / 18

5. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Introduction/2 Link Discovery What? Discover new links among resources. How? Using supervised and unsupervised methods. Why? Links are important for data integration, question answering, knowledge extraction. We will focus on supervised machine-learning algorithms. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

7. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Introduction/2 Link Discovery What? Discover new links among resources. How? Using supervised and unsupervised methods. Why? Links are important for data integration, question answering, knowledge extraction. We will focus on supervised machine-learning algorithms. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

9. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Preliminaries Link Discovery. Given two datasets S and T, the general aim of link discovery is to

10. nd the set of resource pairs (s; t) 2 S T such that R(s; t) holds, where R is a given relation such as owl:sameAs or dbp:near. Link Speci

11. cation. A link speci

12. cation is a rule composed by a complex similarity function sim and a threshold that de

13. nes which pairs (s; t) should be linked together: sim(s; t) Main problems 1 Nave approaches demand quadratic time complexity. 2 Ecient algorithms ; accurate link speci

14. cations. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

30. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Motivation We want to answer these questions. Q1: Which of the paradigms achieves the best F-measures? Q2: Which of the paradigms is most robust against noise? Q3: Which of the methods is the most time-ecient? T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

36. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Overview/1 Evaluation pipeline Alignment between properties is carried out manually. Perfect mapping (i.e., labels) (s; t) is a positive example i R(s; t) holds. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

38. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Overview/2 Assumptions The complex similarity function sim compares property values. In case of datatype properties: it uses text/numerical/date similarities. object properties: it applies the similarities iteratively. Graph structure has not been considered as a feature per se. Cross-validation has been preferred over semi-supervised learning because it yields more accurate results. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

40. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Evaluation Setup/1 Similarities for string values: Weighted trigram similarity, setting tf-idf scores as weights Weighted edit distance, setting confusion matrices as weights Cosine similarity for numerical values: Logarithmic similarity for date values: a day-based Date similarity T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

42. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Evaluation Setup/2 Linear non-probabilistic classi

43. ers Linear SVM* Polynomial SVM* Linear SVM with Sequential Minimal Optimization Linear Regression Probabilistic classi

44. ers Logistic Regression Nave Bayes Random Tree J48 Neural networks Multilayer Perceptron Rule-based classi

45. ers Decision Table We used classi

46. ers from the Weka library, except (*) from LibSVM. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

48. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Evaluation Setup/3 Datasets D1-D3: synthetic datasets from the Ontology Alignment Evaluation Initiative (OAEI) 2010 Benchmark D4-D6: real datasets from the Benchmark for Entity Resolution, DBS Leipzig D5-D6: datasets having a high level of noise # dataset domain size D1 OAEI-Persons1 personal data 250k D2 OAEI-Persons2 personal data 240k D3 OAEI-Restaurants places 72k D4 DBLP{ACM bibliographic 6M D5 Amazon{GoogleProducts e-commerce 10M D6 ABT{Buy e-commerce 1M T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

50. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/1 F-measure Classi

51. er D1 D2 D3 D4 D5 D6 Linear SVM 99.40% 98.99% 97.75% 97.81% 27.06% 39.18% Linear SMO 100.00% 98.73% 100.00% 92.58% 46.63% 31.39% Polynomial-3 SVM 99.40% 93.76% 98.29% 97.67% 37.28% 31.69% Multilayer Perceptron 99.50% 99.50% 100.00% 97.43% 35.58% 43.49% Logistic Regression 99.90% 98.12% 96.67% 97.71% 40.64% 41.92% Linear Regression 99.30% 96.92% 100.00% 96.36% 37.06% 36.84% Nave Bayes 97.75% 35.05% 95.19% 29.47% 2.92% 11.90% Decision Table 97.98% 100.00% 100.00% 97.66% 42.44% 29.66% Random Tree 97.45% 99.24% 89.89% 96.82% 39.38% 41.03% J48 99.50% 95.56% 98.29% 97.66% 44.28% 31.53% State of the Art 100.00% 100.00% 100.00% 98.20% 62.10% 71.30% F-measure calculated on the class of positive examples. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

53. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/2 Computation runtimes Classi

54. er D1 D2 D3 D4 D5 D6 Linear SVM 7.16 6.93 2.67 63.94 484.29 75.44 Linear SMO 17.07 12.93 3.77 113.40 369.20 37.16 Polynomial-3 SVM 5.67 6.18 2.63 162.82 1,091.10 103.89 Multilayer Perceptron 15.13 16.10 3.40 96.96 376.26 41.68 Logistic Regression 16.11 14.91 4.61 110.12 275.94 38.48 Linear Regression 16.04 16.21 5.02 120.54 497.43 44.50 Nave Bayes 17.34 17.09 4.39 105.31 375.91 43.79 Decision Table 16.68 16.44 3.78 90.99 389.35 48.87 Random Tree 12.02 11.16 2.24 53.67 347.36 34.11 J48 21.31 15.96 6.99 131.57 98.27 38.46 All values in seconds. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

56. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/3 Considerations Some average trends can be suggested, yet no algorithm outperforms all other signi

57. cantly. Multilayer Perceptrons performed best including and excluding noisy datasets. Random Trees seem the fastest approach overall. The dierent approaches seem complementary on their behaviour. Nave Bayes might fail as it considers all features as independent from each other. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

59. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/4 Answers Q1: Which of the paradigms achieves the best F-measures? A1: Multilayer Perceptrons, Linear SVMs, Decision Tables. Q2: Which of the paradigms is most robust against noise? A2: Logistic Regression, Random Trees, Multilayer Perceptrons. Q3: Which of the methods is the most time-ecient? A3: Random Trees, however all approaches scale well. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

67. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Related Work Time-ecient deduplication algorithms (PPJoin+, EDJoin, PassJoin, TrieJoin) LIMES { Link Discovery Framework for Metric Spaces Approaches for learning link speci

68. cations (HYPPO, HR3, EAGLE, ACIDS) Dedicated ecient methods (RDF-AI, REEDED) LinkLion { A Link Repository for the Web of Data The SAIM interface Other link discovery frameworks (SILK, LDIF) Other machine learning frameworks (MARLIN, FEBRL, RAVEN) Other blocking techniques (MultiBlock, KnoFuss) T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

70. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Future Work 1 Integration of Multilayer Perceptrons into the LIMES framework. 2 Use of ensemble learning techniques. 3 Evaluation on a semi-supervised learning setting with few training data. 4 Evaluation using a larger amount of similarity measures. 5 Incorporation of a component based on Statistical Relational Learning. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

72. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Web resources Source code { Batch Learners Evaluation for Link Discovery http://github.com/mommi84/BALLAD Technical report { Batch Learners Evaluation for Link Discovery http://mommi84.github.io/BALLAD The OAEI 2010 Benchmark http://oaei.ontologymatching.org/2010/benchmarks The Benchmark for Entity Resolution, DBS Leipzig http://goo.gl/bvWBjA Weka { Data Mining Software in Java http://www.cs.waikato.ac.nz/ml/weka LibSVM { A Library for Support Vector Machines http://www.csie.ntu.edu.tw/~cjlin/libsvm LIMES { Link Discovery Framework for Metric Spaces http://aksw.org/Projects/LIMES LinkLion { A Link Repository for the Web of Data http://www.linklion.org T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

74. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Thank you for your attention. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi

A Comparison of Supervised Learning Classifiers for Link Discovery

Recommended

Recommended

More Related Content

Similar to A Comparison of Supervised Learning Classifiers for Link Discovery

Similar to A Comparison of Supervised Learning Classifiers for Link Discovery (20)

Recently uploaded

Recently uploaded (20)

A Comparison of Supervised Learning Classifiers for Link Discovery