Presentation given by Saminda Abeyruwan at the 6th Uncertainty Reasoning for the Semantic Web Workshop at the 9th International Semantic Web Conference in November 7, 2010.
Paper: PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabilistic Methods
Abstract: Formalizing an ontology for a domain manually is well-known as a tedious and cumbersome process. It is constrained by the knowledge acquisition bottleneck. Therefore, researchers developed algorithms and systems that can help to automatize the process. Among them are systems that include text corpora for the acquisition. Our idea is also based on vast amount of text corpora. Here, we provide a novel unsupervised bottom-up ontology generation method. It is based on lexico-semantic structures and Bayesian reasoning to expedite the ontology generation process. We provide a quantitative and two qualitative results illustrating our approach using a high throughput screening assay corpus and two custom text corpora. This process could also provide evidence for domain experts to build ontologies based on top-down approaches.
Default Logics for Plausible Reasoning with Controversial Axioms
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabilistic Methods
1. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
PrOntoLearn: Unsupervised lexico-semantic
ontology generation using probabilistic methods
Saminda Abeyruwan1 Ubbo Visser1 Vance Lemmon2
Stephan Sch¨rer3
u
Department of Computer Science, University of Miami
The Miami Project to Cure Paralysis, University of Miami Miller School of Medicine
Department of Molecular and Cellular Pharmacology, University of Miami Miller School of
Medicine
URSW 2010 7th November, 2010
2. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Outline
1 Motivation
2 Related work
3 Deficiencies
4 Research approach
5 Results
6 Discussion
7 Summary & Future work
8 Questions
3. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Motivation
Why?
1 An ontology is a formal, explicit specification of a shared
conceptualisation [TRG93, RS98]
2 Knowledge-bases are represented by ontologies [UMLS09]
3 Formalizing an ontology for a domain is a tedious and cumbersome
process (Knowledge acquisition bottleneck (KAB))
4 Substantially large text corpora available to be classified into an
ontology [BAO09]
5 Text corpora of the domain of discourse contains
Redundancy
Structured and unstructured text
Noisy data (Uncertainty via Degree of belief)
Lexical disambiguities
Semantic heterogeneity problems
6 Research on KAB is highly investigated by the Semantic (Web)
Community
4. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
General idea
General idea
1 Reverse engineering an ontology (bottom-up) (Lexicon ⇒ An
ontology)
2 Bayesian reasoning to deal with degree of belief
3 Conceptualization is learned through probabilistic reasoning
4 Lexicon-semantic structues extracted from Wordnet 3.0 [WN3009]
5 Use top-down approach to check the consistency of the generated
ontology
6 Constrained by conditions and hypotheses
7 Serialize the learned ontology into OWL DL and query using
SPARQL
“A little semantics goes a long way” - Hendler hypothesis [JH03]
5. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Probabilistic reasoning & Heterogeneity
Probabilistic reasoning
P-CLASSIC [DK97]
P-OWL extension [ZD04]
P-SHIF(D), P-SHOIN(D) & P-Pellet [TL07, PP08]
Heterogeneity
Read the web project [TM09, TM10]
SEAL, iSEAL & ASIA [RW07, RW08, RW09]
Taxonomy induction [RS06]
LOD [JB09, LD06]
6. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Knowledge acquisition & ontology learning
Knowledge acquisition
Approaches [PC09, DSK09, LS09, HC09, JB05]
Large scale knowledge extraction
Knowledge integration
Extracting commonsensical knowledge
Textual entailment with first-order-logic
Tools [TTO00, SS09, OLSW02, TTO01, HT09]
Text-To-Onto, Text2Onto, OntoWare.org LExO & HermiT
Ontology learning
Learning [PC09, PH05, CC08, JL09, LBM08]
Dealing with uncertainty and inconsistency
Semantic concepts with unsupervised statistical learning
Semantic Web Services & floksonomy
Formal concept analysis [PC05]
7. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Deficiencies
Related work
Pros
Learning terms, synonyms, concepts, taxonomies, rules, relations and
axioms for ontology O
NLP, dictionary passing, statistical methods & machine learning
techniques and co-occurrence among terms
Cons
Top-down approach. Classification or an ontology is given
Uncertainty is dealt with a domain expert
Most of the conceptualisation is learned by predefined rules
Our approach
1 Substantially large text corpora
2 Uncertainty is represented with probabilistic approach
3 Unsupervised learning
4 Hypothesis: an ontology generation is much faster
5 Goal: to achieve maximum confidence
8. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Goals
Goals
1 To generate consistent lexico-semantic ontology O with a T − Box
and a A − Box that can be serialized into OWL DL
2 Querying via SPARQL [SPARQL08] [JENA09]
How do we start ?
1 Corpus C contains a lot of documents di (di ∈ C ) for i = 1, 2, 3, . . .
2 Learned lexicon set L contains a finite list of words wj
(L = w1 , w2 , . . . , wn ) and group set G contains a finite set of groups
gk (G = g1 , g2 , . . . , gm )
9. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Overall process
10. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Definition
The lexicon L is the set that contains words belonging to the universe of
English vocabulary, which is part-of-speech type tagged with the Penn
Treebank English POS tag set [PT10] and the type of the word IS,
Term Description
NN Noun, singular or mass
NNP Proper Noun, singular
NNS Noun, plural
NNPS Proper Noun, plural
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
11. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Phases
Phases
1 Pre-processing
Stanford tagger (the Pen Treebank POS tagger)
Filter elements for lexicon
2 Syntactic analysis
Boostrap algorithm to count frequencies of words, groups
Normalizing, stemming and lemmatization of words
3 Semantic analysis
Bayesian reasoning to produce concepts and relations
Subsumption hierarchy induction
Hyponym and meronym analysis
4 Representation
Serialize to OWL DL
12. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Pre-processing
Filter
Regex ([a-zA-Z]+[- ]?w*) , Length of a word (2)
Example
1 The mevalonate pathway is comprised of three consecutive reactions
that are catalyzed by the enzymes mevalonate kinase (MK; E.C.
2.7.1.36), phosphomevalonate kinase (PMK; E.C. 2.7.4.2), and
diphosphomevalonate decarboxylase (PDM-DC; E.C. 4.1.1.33).
2 The DT mevalonate JJ pathway NN is VBZ comprised VBN of IN
three CD consecutive JJ reactions NNS that WDT are VBP
catalyzed VBN by IN the DT enzymes NNS mevalonate VBP
kinase NN -LRB- -LRB- MK NNP ; : E.C. NNP 2.7.1.36 CD
-RRB- -RRB- , , phosphomevalonate JJ kinase NN -LRB- -LRB-
PMK NNP ; : E.C. NNP 2.7.4.2 CD -RRB- -RRB- , , and CC
diphosphomevalonate JJ decarboxylase NN -LRB- -LRB-
PDM-DC NN ; : E.C. NNP 4.1.1.33 CD -RRB- -RRB- . .
13. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Pre-processing
Filter
Regex ([a-zA-Z]+[- ]?w*) , Length of a word (2)
Example
1 The mevalonate pathway is comprised of three consecutive reactions
that are catalyzed by the enzymes mevalonate kinase (MK; E.C.
2.7.1.36), phosphomevalonate kinase (PMK; E.C. 2.7.4.2), and
diphosphomevalonate decarboxylase (PDM-DC; E.C. 4.1.1.33).
2 The DT mevalonate JJ pathway NN is VBZ comprised VBN of IN
three CD consecutive JJ reactions NNS that WDT are VBP
catalyzed VBN by IN the DT enzymes NNS mevalonate VBP
kinase NN -LRB- -LRB- MK NNP ; : E.C. NNP 2.7.1.36 CD
-RRB- -RRB- , , phosphomevalonate JJ kinase NN -LRB- -LRB-
PMK NNP ; : E.C. NNP 2.7.4.2 CD -RRB- -RRB- , , and CC
diphosphomevalonate JJ decarboxylase NN -LRB- -LRB-
PDM-DC NN ; : E.C. NNP 4.1.1.33 CD -RRB- -RRB- . .
14. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Syntactic analysis
Bootstrap
1 di (di ∈ C ) for i = 1, 2, 3, . . .
2 From di read each sentence sj using OpenNLP
(sj ∈ di for j = 1, 2, 3, . . .)
3 Generate lexicon L according to the definition of lexicon
4 Each lexis wk ∈ L is normalized: find lemma or stemmed using
Wordnet 3.0
5 Candidate semantic groups gl using N − Gram model for lexis wk
[SJB09]
6 Candidate binary relationships vi (gj , gk ) vi , gk ∈ L using pattern
(NW OW VW NW OW )∗
∗ ∗ ∗ ∗
15. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
N-Gram model
3-Gram model
4-Gram model
Probability
P(wi |gj ) where i > 0, j > 0
16. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
N-Gram model
3-Gram model
4-Gram model
Probability
P(wi |gj ) where i > 0, j > 0
17. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
T-Box subsumption model
Subsumption model
BN4
w4
BN1 BN2 BN5
w1 w2 w5
BN3
g4
w3
g1 g2 g2
g3
18. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
T-Box relations model
Relations model
Semantic mapping
p(C1 , C2 |V ) = p(C1 |V )p(C2 |V ) → V (C1 , C2 )
19. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Semantic analysis & representation
Semantics
1 Calculate probabilities
2 T-Box subsumption model. Pruning parameter KF
3 T-Box relations model. Pruning parameter RF
4 Antonomy pruning
5 Subsumption hierachy induction
6 Hyponomy and meronym analisys using Wordnet recognizable words
7 Serialize models to OWL DL
20. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Example: T-Box Subsumption
Example
21. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Example: T-Box Relations
Example
22. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Example - Subsumption hierachy induction
Example
23. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Datasets
Datasets
1 PubChem assays, large public hight throughput screening dataset
[BAO09] (primary, qualitative evaluation). (Semantic Web Challenge
2010, http://bioassayontology.org )
2 Sample collection of 218 web pages extracted from the University of
Miami, Dept. of Computer Science (www .cs.miami.edu) domain
(quantitative evaluation)
3 Sample collection of 38 pdf files from ISWC 2009 proceedings
(secondary)
24. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Dataset: www .cs.miami.edu domain
Detaset
Title Statistics Description
All documents are xhtml
Documents 218 formated with a give template
Norm. candidate concept words
Unique ConceptWords 5,384 from NN, NNP, NNS, JJ, JJR
& JJS using [a-zA-Z]+[- ]?w*
Norm. verbs from
Unique Verbs 835 VB, VBD, VBG, VBN, VBP
& VBZ using [a-zA-Z]+[- ]?w*
Total ConceptWords 39,455
Total Verbs 4,797
Total Lexicon 44,252 L = ConceptWords Verbs
Total Groups 39,455
27. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Dataset: PubChem dataset (primary)
Dataset
Title Statistics Description
All documents are xhtml
Documents 1,759 formated with a given template
Norm. candidate concept words
Unique ConceptWords 13,017 from NN, NNP, NNS, JJ, JJR
& JJS using [a-zA-Z]+[- ]?w*
Norm. verbs from
Unique Verbs 1,337 VB, VBD, VBG, VBN, VBP
& VBZ using [a-zA-Z]+[- ]?w*
Total ConceptWords 631,623
Total Verbs 109,421
Total Lexicon 741,044 L = ConceptWords Verbs
Total Groups 631,623
28. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Dataset: BioAssay ontology dataset (primary)
Evaluation: qualitative
Availability of ground truth
Domain expert evaluation (Prof. Stephan Schuerer)
Results for 3-gram
Rich vocabulary
Good structure
Suitable as a seeding ontology to influence domain experts decisions
30. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Discussion
Discussion
NLP expressions and our expression. Semantic attachment
Substantial amount of data
Distinction between concepts and individuals of the concepts
WordNet unrecognizable words. Porter stemming algorithm.
Complexity
Syntactic layer: O(M × max(sj) × max(wk))
Semantic layer: O(|L| × |SuperConcepts|)
Representation layer: complexity of Jena object model serializer
Pellet and Fact++ reasoner output
31. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Summary & Future work
Summary
Goal: The construction of an ontology for a random corpus
Achievement: Seed ontology construction for a random text corpus
Probabilistic reasoning to classify lexico-semantic structures
Future work
Inclusion of a set of English grammar rules to the N-gram models to
get variable window sizes
Extract information from other sources to provide a human readable
concepts and roles
Computational lexical semantics
Expand the scope with adding more Pen Treebank tags
32. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Summary & Future work
Summary
Goal: The construction of an ontology for a random corpus
Achievement: Seed ontology construction for a random text corpus
Probabilistic reasoning to classify lexico-semantic structures
Future work
Inclusion of a set of English grammar rules to the N-gram models to
get variable window sizes
Extract information from other sources to provide a human readable
concepts and roles
Computational lexical semantics
Expand the scope with adding more Pen Treebank tags
33. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
Questions
34. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
UMLS. Unified Medical Language System http://www.nlm.nih.gov/research/umls/ , 2009
Alfresco Share Team. Alfresco BioAssayOntology University of Miami, http://share.ccs.miami.edu/share/page/site-index , 2009.
T. Berners-Lee. Linked Data W3C Design Issues, 2006.
J. Volker, P. Haase and P. Hitzler Learning Expressive Ontologies Volume 2 Studies on the Semantic Web, 2009
T. Mitchell. Populating the Semantic Web by Macro-Reading Internet Text ISWC Keynote, 2009.
A. Maedche and S. Staab The TEXT-TO-ONTO Ontology Learning Environment, ICCS, 2000.
A. Maedche. Ontology Learning for the Semantic Web, Kluwer Academic Publishers,2002.
S. Staab and R. Studer. Handbook on Ontologies International Handbooks on Information Systems, 2009
A. Maedche and R. Volz The Ontology Extraction and Maintenance Framework Text-To-Onto, In proceeding of the ICDM’01
Workshop on Integrating Data Mining and Knowledge Management, 2001
P. Cimiano and J. Volker Text2Onto A framework for Ontology Learning and Data-driven Change Discovery, Proceedings of
the 10th Internatioanl Conference on Applications of Natural Language to Information System (NLDB), volume 3513 of LNCS,
pages 227-238, Alicante, Spain, Springer, 2005
P. Haase and J. Volker Ontology Learning and Reasoning - Dealing with Uncertainty and Inconsistency, Proceedings of the
workshop on Uncertainty Reasoning for the semantic web, (URSW pages 45-55), 2005
P. Clark and P. Harrison. arge-Scae Extraction and Use of Knowledge from Text, K-Cap, 2009.
D.S. Kim, K. Barker, and B. Porter. Knowledge Integration Across Multiple Texts, K-Cap, 2009.
L. Schubert. Can be Derive General World Knowledge from Texts?. K-Cap, 2009.
H.C. Cankaya and D. Moldovan. Method for Extracting Commonsense Knowledge K-Cap, 2009.
35. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
J. Bos and K. Markert. Recognising textual entailment with logical inference, Proceedings of the conference on Human
Language Technology and Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada Pages: 628
- 635, 2005.
C. Chemudugunta, A. Holloway, P. Smyth and M. Steyvers Modeling Documents by Combining Semantic Concepts with
Unsupervised Statistical Learning, LNCS 5318, 2008.
L.B. Marinho, K. Buza and L. Schmidt-Thieme Floksonomy-Based Collabulary Learning, LNCS 5318, 2008.
J.L. Ambite, S. Darbha, A. Goel, C.A. Knoblock,K. Lerman, R. Parundekar, T. Russ Automatically Constructing Semantic Web
Services from Online Sources, LNCS 5823, 2009.
S. Russel and P. Norving Artificial Intelligence, A Modern Approach, 2nd ed. Prentice Hall Series in Artificial Intelligence, 2001.
S. Banerjee and T. Pedersen The Design, Implementation and Use of the Ngram Statistic Package, LNCS 2588, 2009.
T. Pedersen, M. Kayaalp and R. Bruce Significant Lexical Relationships, 13th National Conference on Artificial Intelligence,
1996.
Open NLP, http://opennlp.sourceforge.net/ ,2009.
WordNet 3.0. http://wordnet.princeton.edu/, 2009.
A. Ghazvinian, N.F. Noy, C. Jonquet, N. Shah and M.A. Musen. What Four Million Mappings Can Tell You about Two Hundred
Ontologies, LNCS 5823, 2009.
J. Dolby, A. Fokoue, A. Kalyanpur, E. Schonberg and K. Srinivas Extracting Enterprise Vocabularies Using Linked Open Data,
LNCS 5823, 2009..
R. Snow, D. Jurafsky, A.Y. Ng Semantic Taxonomy Induction from Heterogeneous Evidence, Proceedings of the 21st
International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational
Linguistics, Sydney, Australia, Pages: 801 - 808, 2006.
R. Wang and W.W. Cohen Language-Independent Set Expansion of Named Entities using the Web, Proceedings of the 2007
Seventh IEEE International Conference on Data Mining, 2007.
36. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
R. Want and W.W. Cohen Iterative Set Expansion of Named Entitles using the Web, , 2008 8th International Conference on
Data Mining, 2008.
R. Wang and W.W. Cohen Automatic Set Instance Extraction using the Web, In Proceedings of the 47th Annual Meeting of
the ACL and the 4th IJCNLP of the AFNLP, 2009
SPARQL. SPARQL Query Language for RDF, W3C Recommendation 15 January 2008,
http://www.w3.org/TR/rdf-sparql-query/, 2008.
Jena. A Semantic Web Framework for Java, http://jena.sourceforge.net/ , 2009.
P. Cimiano. Ontology Learning and Population from Text: Algorithms, Evaluation and Applications, Springler, 2006
P. Mika. Social Networks and the Semantic Web, Springler, 2007.
T.R. Gruber. Knowledge Acquisition, A Translation Approach to Portable Ontologies. 5(2):199-220, 1993.
R. Studer, R. Benjamins and D. Fensel. Data & Knowledge Engineering, Knowledge Engineering: Principles and methods.
25(1-2):161-198, 1998.
J. Hendler. On beyond ontology, Keynote talk, Second Internatioanl Semantic Web Conference, 2003.
P. Cimiano, A. Madche, S. Staab and J. Volker. Ontology Learning, Handbook On Ontologies, 254-267, 2009
D. Koller and A. Levy and A. Pfeffer P-CLASSIC: A tractable probabilistic description logic, In Proceedings of AAAI-97, Pages
390–397, 1997.
Z. Ding and Yun Peng. A Probabilistic Extension to Ontology Language OWL, Proceedings of the 37th Hawaii International
Conference on System Sciences, 2004.
T. Lukasiewicz. Probabilistic description logics for the semantic web, Technical Report Nr. 1843-06-05, Institut fur
Informationssysteme, Technische Universitat Wien, 2007.
Pellet Pronto. Pellet Pronto, http://pellet.owldl.com/pronto/, 2008.
37. Motivation Related work Deficiencies Research approach Results Discussion Sum. Fw Questions
A. Carlson,J. Betteridge, R. C. Wang,E. R. Hruschka Jr. and T. M. Mitchell. Coupled Semi-Supervised Learning for Information
Extraction, Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM 2010), 2010.
P. Cimiano and A. Hotho and S. Staab. Learning Concept Hierarchies from Text Corpora Using Formal Concept Analysis, Journal
of Artificial Intelligence research, Pages 305–339, 2005.
The Penn Treebank Project. The Penn Treebank Project, http://www.cis. upenn.edu/ treebank/, 2010.
HermiT. Reasoning with Large Ontologies, http://www.comlab.ox.ac.uk/projects/HermiT/, 2010.