SlideShare a Scribd company logo
1 of 26
Download to read offline
A Comparison of Supervised Learning Classi
ers 
for Link Discovery 
Tommaso Soru and Axel-Cyrille Ngonga Ngomo 
Agile Knowledge Engineering and Semantic Web 
Department of Computer Science 
University of Leipzig 
Augustusplatz 10, 04109 Leipzig 
ftsoru,ngongag@informatik.uni-leipzig.de 
http://aksw.org 
September 4, 2014
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Introduction/1 
The 4th Linked Data Web Principle. 
Include links to other URIs, so that they can discover more 
things." { Tim Berners-Lee 
31B triples in 2011 
of which only  3% link 
dierent datasets 
 71B triples expected in 
2014 
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
ers for Link Discovery 
2 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Introduction/2 
Link Discovery 
What? Discover new links among resources. 
How? Using supervised and unsupervised methods. 
Why? Links are important for data integration, question 
answering, knowledge extraction. 
We will focus on supervised machine-learning algorithms. 
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
ers for Link Discovery 
3 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Introduction/2 
Link Discovery 
What? Discover new links among resources. 
How? Using supervised and unsupervised methods. 
Why? Links are important for data integration, question 
answering, knowledge extraction. 
We will focus on supervised machine-learning algorithms. 
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
ers for Link Discovery 
3 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Preliminaries 
Link Discovery. 
Given two datasets S and T, the general aim of link discovery is to
nd the set 
of resource pairs (s; t) 2 S  T such that R(s; t) holds, where R is a given 
relation such as owl:sameAs or dbp:near. 
Link Speci
cation. 
A link speci
cation is a rule composed by a complex similarity function sim and 
a threshold  that de
nes which pairs (s; t) should be linked together: 
sim(s; t)   
Main problems 
1 Nave approaches demand quadratic time complexity. 
2 Ecient algorithms ; accurate link speci
cations. 
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
ers for Link Discovery 
4 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Preliminaries 
Link Discovery. 
Given two datasets S and T, the general aim of link discovery is to
nd the set 
of resource pairs (s; t) 2 S  T such that R(s; t) holds, where R is a given 
relation such as owl:sameAs or dbp:near. 
Link Speci
cation. 
A link speci
cation is a rule composed by a complex similarity function sim and 
a threshold  that de
nes which pairs (s; t) should be linked together: 
sim(s; t)   
Main problems 
1 Nave approaches demand quadratic time complexity. 
2 Ecient algorithms ; accurate link speci
cations. 
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
ers for Link Discovery 
4 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Preliminaries 
Link Discovery. 
Given two datasets S and T, the general aim of link discovery is to
nd the set 
of resource pairs (s; t) 2 S  T such that R(s; t) holds, where R is a given 
relation such as owl:sameAs or dbp:near. 
Link Speci
cation. 
A link speci
cation is a rule composed by a complex similarity function sim and 
a threshold  that de

More Related Content

Similar to A Comparison of Supervised Learning Classifiers for Link Discovery

A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithmsIkutwa
 
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...tmra
 
Towards Transfer Learning of Link Specifications
Towards Transfer Learning of Link SpecificationsTowards Transfer Learning of Link Specifications
Towards Transfer Learning of Link Specificationsgeoknow
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionTELKOMNIKA JOURNAL
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.Giuseppe Ricci
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsEnrico Palumbo
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...BaoTramDuong2
 
A03730108
A03730108A03730108
A03730108theijes
 
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...Hendrik Drachsler
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Xiaohan Zeng
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalMounia Lalmas-Roelleke
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterSudarsun Santhiappan
 
A Soft Set-based Co-occurrence for Clustering Web User Transactions
A Soft Set-based Co-occurrence for Clustering Web User TransactionsA Soft Set-based Co-occurrence for Clustering Web User Transactions
A Soft Set-based Co-occurrence for Clustering Web User TransactionsTELKOMNIKA JOURNAL
 
Advantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalAdvantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalOnur Yılmaz
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 

Similar to A Comparison of Supervised Learning Classifiers for Link Discovery (20)

A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
 
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
 
Towards Transfer Learning of Link Specifications
Towards Transfer Learning of Link SpecificationsTowards Transfer Learning of Link Specifications
Towards Transfer Learning of Link Specifications
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detection
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender Systems
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
 
A03730108
A03730108A03730108
A03730108
 
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
Discovering emerging effects in Learning Networks with simulations Hendrik Dr...
 
Ijetcas14 624
Ijetcas14 624Ijetcas14 624
Ijetcas14 624
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam Filter
 
A Soft Set-based Co-occurrence for Clustering Web User Transactions
A Soft Set-based Co-occurrence for Clustering Web User TransactionsA Soft Set-based Co-occurrence for Clustering Web User Transactions
A Soft Set-based Co-occurrence for Clustering Web User Transactions
 
Advantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalAdvantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information Retrieval
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
G04124041046
G04124041046G04124041046
G04124041046
 

Recently uploaded

Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 

A Comparison of Supervised Learning Classifiers for Link Discovery

  • 1. A Comparison of Supervised Learning Classi
  • 2. ers for Link Discovery Tommaso Soru and Axel-Cyrille Ngonga Ngomo Agile Knowledge Engineering and Semantic Web Department of Computer Science University of Leipzig Augustusplatz 10, 04109 Leipzig ftsoru,ngongag@informatik.uni-leipzig.de http://aksw.org September 4, 2014
  • 3. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Introduction/1 The 4th Linked Data Web Principle. Include links to other URIs, so that they can discover more things." { Tim Berners-Lee 31B triples in 2011 of which only 3% link dierent datasets 71B triples expected in 2014 T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 4. ers for Link Discovery 2 / 18
  • 5. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Introduction/2 Link Discovery What? Discover new links among resources. How? Using supervised and unsupervised methods. Why? Links are important for data integration, question answering, knowledge extraction. We will focus on supervised machine-learning algorithms. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 6. ers for Link Discovery 3 / 18
  • 7. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Introduction/2 Link Discovery What? Discover new links among resources. How? Using supervised and unsupervised methods. Why? Links are important for data integration, question answering, knowledge extraction. We will focus on supervised machine-learning algorithms. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 8. ers for Link Discovery 3 / 18
  • 9. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Preliminaries Link Discovery. Given two datasets S and T, the general aim of link discovery is to
  • 10. nd the set of resource pairs (s; t) 2 S T such that R(s; t) holds, where R is a given relation such as owl:sameAs or dbp:near. Link Speci
  • 11. cation. A link speci
  • 12. cation is a rule composed by a complex similarity function sim and a threshold that de
  • 13. nes which pairs (s; t) should be linked together: sim(s; t) Main problems 1 Nave approaches demand quadratic time complexity. 2 Ecient algorithms ; accurate link speci
  • 14. cations. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 15. ers for Link Discovery 4 / 18
  • 16. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Preliminaries Link Discovery. Given two datasets S and T, the general aim of link discovery is to
  • 17. nd the set of resource pairs (s; t) 2 S T such that R(s; t) holds, where R is a given relation such as owl:sameAs or dbp:near. Link Speci
  • 18. cation. A link speci
  • 19. cation is a rule composed by a complex similarity function sim and a threshold that de
  • 20. nes which pairs (s; t) should be linked together: sim(s; t) Main problems 1 Nave approaches demand quadratic time complexity. 2 Ecient algorithms ; accurate link speci
  • 21. cations. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 22. ers for Link Discovery 4 / 18
  • 23. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Preliminaries Link Discovery. Given two datasets S and T, the general aim of link discovery is to
  • 24. nd the set of resource pairs (s; t) 2 S T such that R(s; t) holds, where R is a given relation such as owl:sameAs or dbp:near. Link Speci
  • 25. cation. A link speci
  • 26. cation is a rule composed by a complex similarity function sim and a threshold that de
  • 27. nes which pairs (s; t) should be linked together: sim(s; t) Main problems 1 Nave approaches demand quadratic time complexity. 2 Ecient algorithms ; accurate link speci
  • 28. cations. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 29. ers for Link Discovery 4 / 18
  • 30. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Motivation We want to answer these questions. Q1: Which of the paradigms achieves the best F-measures? Q2: Which of the paradigms is most robust against noise? Q3: Which of the methods is the most time-ecient? T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 31. ers for Link Discovery 5 / 18
  • 32. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Motivation We want to answer these questions. Q1: Which of the paradigms achieves the best F-measures? Q2: Which of the paradigms is most robust against noise? Q3: Which of the methods is the most time-ecient? T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 33. ers for Link Discovery 5 / 18
  • 34. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Motivation We want to answer these questions. Q1: Which of the paradigms achieves the best F-measures? Q2: Which of the paradigms is most robust against noise? Q3: Which of the methods is the most time-ecient? T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 35. ers for Link Discovery 5 / 18
  • 36. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Overview/1 Evaluation pipeline Alignment between properties is carried out manually. Perfect mapping (i.e., labels) (s; t) is a positive example i R(s; t) holds. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 37. ers for Link Discovery 6 / 18
  • 38. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Overview/2 Assumptions The complex similarity function sim compares property values. In case of datatype properties: it uses text/numerical/date similarities. object properties: it applies the similarities iteratively. Graph structure has not been considered as a feature per se. Cross-validation has been preferred over semi-supervised learning because it yields more accurate results. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 39. ers for Link Discovery 7 / 18
  • 40. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Evaluation Setup/1 Similarities for string values: Weighted trigram similarity, setting tf-idf scores as weights Weighted edit distance, setting confusion matrices as weights Cosine similarity for numerical values: Logarithmic similarity for date values: a day-based Date similarity T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 41. ers for Link Discovery 8 / 18
  • 42. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Evaluation Setup/2 Linear non-probabilistic classi
  • 43. ers Linear SVM* Polynomial SVM* Linear SVM with Sequential Minimal Optimization Linear Regression Probabilistic classi
  • 44. ers Logistic Regression Nave Bayes Random Tree J48 Neural networks Multilayer Perceptron Rule-based classi
  • 45. ers Decision Table We used classi
  • 46. ers from the Weka library, except (*) from LibSVM. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 47. ers for Link Discovery 9 / 18
  • 48. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Evaluation Setup/3 Datasets D1-D3: synthetic datasets from the Ontology Alignment Evaluation Initiative (OAEI) 2010 Benchmark D4-D6: real datasets from the Benchmark for Entity Resolution, DBS Leipzig D5-D6: datasets having a high level of noise # dataset domain size D1 OAEI-Persons1 personal data 250k D2 OAEI-Persons2 personal data 240k D3 OAEI-Restaurants places 72k D4 DBLP{ACM bibliographic 6M D5 Amazon{GoogleProducts e-commerce 10M D6 ABT{Buy e-commerce 1M T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 49. ers for Link Discovery 10 / 18
  • 50. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/1 F-measure Classi
  • 51. er D1 D2 D3 D4 D5 D6 Linear SVM 99.40% 98.99% 97.75% 97.81% 27.06% 39.18% Linear SMO 100.00% 98.73% 100.00% 92.58% 46.63% 31.39% Polynomial-3 SVM 99.40% 93.76% 98.29% 97.67% 37.28% 31.69% Multilayer Perceptron 99.50% 99.50% 100.00% 97.43% 35.58% 43.49% Logistic Regression 99.90% 98.12% 96.67% 97.71% 40.64% 41.92% Linear Regression 99.30% 96.92% 100.00% 96.36% 37.06% 36.84% Nave Bayes 97.75% 35.05% 95.19% 29.47% 2.92% 11.90% Decision Table 97.98% 100.00% 100.00% 97.66% 42.44% 29.66% Random Tree 97.45% 99.24% 89.89% 96.82% 39.38% 41.03% J48 99.50% 95.56% 98.29% 97.66% 44.28% 31.53% State of the Art 100.00% 100.00% 100.00% 98.20% 62.10% 71.30% F-measure calculated on the class of positive examples. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 52. ers for Link Discovery 11 / 18
  • 53. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/2 Computation runtimes Classi
  • 54. er D1 D2 D3 D4 D5 D6 Linear SVM 7.16 6.93 2.67 63.94 484.29 75.44 Linear SMO 17.07 12.93 3.77 113.40 369.20 37.16 Polynomial-3 SVM 5.67 6.18 2.63 162.82 1,091.10 103.89 Multilayer Perceptron 15.13 16.10 3.40 96.96 376.26 41.68 Logistic Regression 16.11 14.91 4.61 110.12 275.94 38.48 Linear Regression 16.04 16.21 5.02 120.54 497.43 44.50 Nave Bayes 17.34 17.09 4.39 105.31 375.91 43.79 Decision Table 16.68 16.44 3.78 90.99 389.35 48.87 Random Tree 12.02 11.16 2.24 53.67 347.36 34.11 J48 21.31 15.96 6.99 131.57 98.27 38.46 All values in seconds. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 55. ers for Link Discovery 12 / 18
  • 56. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/3 Considerations Some average trends can be suggested, yet no algorithm outperforms all other signi
  • 57. cantly. Multilayer Perceptrons performed best including and excluding noisy datasets. Random Trees seem the fastest approach overall. The dierent approaches seem complementary on their behaviour. Nave Bayes might fail as it considers all features as independent from each other. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 58. ers for Link Discovery 13 / 18
  • 59. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/4 Answers Q1: Which of the paradigms achieves the best F-measures? A1: Multilayer Perceptrons, Linear SVMs, Decision Tables. Q2: Which of the paradigms is most robust against noise? A2: Logistic Regression, Random Trees, Multilayer Perceptrons. Q3: Which of the methods is the most time-ecient? A3: Random Trees, however all approaches scale well. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 60. ers for Link Discovery 14 / 18
  • 61. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/4 Answers Q1: Which of the paradigms achieves the best F-measures? A1: Multilayer Perceptrons, Linear SVMs, Decision Tables. Q2: Which of the paradigms is most robust against noise? A2: Logistic Regression, Random Trees, Multilayer Perceptrons. Q3: Which of the methods is the most time-ecient? A3: Random Trees, however all approaches scale well. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 62. ers for Link Discovery 14 / 18
  • 63. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/4 Answers Q1: Which of the paradigms achieves the best F-measures? A1: Multilayer Perceptrons, Linear SVMs, Decision Tables. Q2: Which of the paradigms is most robust against noise? A2: Logistic Regression, Random Trees, Multilayer Perceptrons. Q3: Which of the methods is the most time-ecient? A3: Random Trees, however all approaches scale well. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 64. ers for Link Discovery 14 / 18
  • 65. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/4 Answers Q1: Which of the paradigms achieves the best F-measures? A1: Multilayer Perceptrons, Linear SVMs, Decision Tables. Q2: Which of the paradigms is most robust against noise? A2: Logistic Regression, Random Trees, Multilayer Perceptrons. Q3: Which of the methods is the most time-ecient? A3: Random Trees, however all approaches scale well. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 66. ers for Link Discovery 14 / 18
  • 67. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Related Work Time-ecient deduplication algorithms (PPJoin+, EDJoin, PassJoin, TrieJoin) LIMES { Link Discovery Framework for Metric Spaces Approaches for learning link speci
  • 68. cations (HYPPO, HR3, EAGLE, ACIDS) Dedicated ecient methods (RDF-AI, REEDED) LinkLion { A Link Repository for the Web of Data The SAIM interface Other link discovery frameworks (SILK, LDIF) Other machine learning frameworks (MARLIN, FEBRL, RAVEN) Other blocking techniques (MultiBlock, KnoFuss) T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 69. ers for Link Discovery 15 / 18
  • 70. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Future Work 1 Integration of Multilayer Perceptrons into the LIMES framework. 2 Use of ensemble learning techniques. 3 Evaluation on a semi-supervised learning setting with few training data. 4 Evaluation using a larger amount of similarity measures. 5 Incorporation of a component based on Statistical Relational Learning. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 71. ers for Link Discovery 16 / 18
  • 72. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Web resources Source code { Batch Learners Evaluation for Link Discovery http://github.com/mommi84/BALLAD Technical report { Batch Learners Evaluation for Link Discovery http://mommi84.github.io/BALLAD The OAEI 2010 Benchmark http://oaei.ontologymatching.org/2010/benchmarks The Benchmark for Entity Resolution, DBS Leipzig http://goo.gl/bvWBjA Weka { Data Mining Software in Java http://www.cs.waikato.ac.nz/ml/weka LibSVM { A Library for Support Vector Machines http://www.csie.ntu.edu.tw/~cjlin/libsvm LIMES { Link Discovery Framework for Metric Spaces http://aksw.org/Projects/LIMES LinkLion { A Link Repository for the Web of Data http://www.linklion.org T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 73. ers for Link Discovery 17 / 18
  • 74. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Thank you for your attention. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  • 75. ers for Link Discovery 18 / 18