SlideShare una empresa de Scribd logo
1 de 9
ONTOLOGY EXTRACTION FOR BUSINESS KNOWLEDGE
                   MANAGEMENT

                               Dr. Raymond Y.K. Lau
                         Department of Information Systems
                            City University of Hong Kong
                       Tat Chee Avenue, Kowloon Hong Kong
                            Email:
                                   raylau@cityu.edu.hk
                    Phone: +852 2788 8495 FAX: +852 2788 8694


                                     ABSTRACT

     Ontology plays an important role in capturing and disseminating business
information (e.g., products, services, relationships of businesses) for effective human
computer interactions. However, engineering of domain ontologies is very labor intensive
and time consuming. Some machine learning methods have been explored for automatic
or semi-automatic discovery of domain ontologies. Nevertheless, both the accuracy and
the computational efficiency of these methods need to be improved to support large scale
ontology construction for real-world business applications. This paper illustrates a novel
domain ontology discovery method which exploits contextual information of the
knowledge sources to construct domain ontology with better quality. The proposed
ontology discovery method has been empirically tested in an e-Learning environment and
the experimental results are encouraging.

Keywords: Domain Ontology, Ontology Discovery, Statistical Learning, Knowledge
Management.




                                  INTRODUCTION

Knowledge has been recognized as the most important corporate asset and it is the key
for organizations to achieve sustainable competitive advantage. Knowledge management
is a collection of processes that govern the creation, dissemination, and utilization of
knowledge [15, 16]. To be able to effectively manage the intellectual capital, businesses
need an effective approach to identify and capture information and knowledge about
business processes, products, services, markets, customers, suppliers, and competitors,
and to share this knowledge to improve the organizations’ goal achievement. Ontologies
allow domain knowledge such as products, services, markets, etc. to be captured in an
explicit and formal way such that it can be shared among human and computer systems.
The notion of ontology is becoming very useful in various fields such as intelligent
information extraction and retrieval, cooperative information systems, electronic


                                            1
commerce, and knowledge management [25]. Since Tim BernersLee, the inventor of the
World Wide Web (Web), coined the vision of a Semantic Web [1] in which background
information of Web resources is stored in the form of machine processable metadata, the
proliferation of ontologies has under tremendous growth. The success of Semantic Web
relies heavily on formal ontologies to structure data for comprehensive and transportable
machine understanding [11]. Although there is not a universal consensus on the definition
of ontology, it is generally accepted that ontology is a specification of conceptualization
[7]. In other words, ontology is a formal representation of concepts and their
interrelationships. It provides a view of the world that we wish to represent for some
purposes [18]. Ontology can take the simple form of a taxonomy (i.e., knowledge
encoded in a minimal hierarchical structure) or a vocabulary with standardized machine
interpretable terminology supplemented with natural language definitions. On the other
hand, the notion of ontology can also be used to describe a logical domain theory with
very expressive, complex, and meaningful information. Ontology is often specified in a
declarative form by using semantic markup languages such as RDF and OWL [5].

Although ontologies are useful in many areas, the engineering of ontologies turns out to
be very expensive and time consuming. Therefore, many automatic or semiautomatic
ontology engineering techniques have been proposed. Automated ontology discovery is
vital for the success of ontology engineering because it deals with the knowledge
acquisition bottleneck which is a classical knowledge engineering problem. Although
fully automatic construction of perfect domain ontology is beyond the current stateof-
the-art, we believe that the automatic ontology extraction method illustrated in this paper
can assist ontology engineers to build domain ontology quicker and more accurately.
Some learning techniques have been applied to the extraction of domain ontology [2, 6,
22]. Nevertheless, these methods are still subject to further enhancement in terms of
computational efficiency and accuracy. One of the ways to improve domain ontology
extraction is to exploit contextual information from the knowledge sources such as the
Internet news about business products, services, and markets. As domain ontology
captures domain (context) dependent information, an effective extraction method should
exploit contextual information in order to build relevant ontologies.


    AN OVERVIEW OF THE ONTOLOGY DISCOVERY METHODOLOGY


Figure 1 depicts the proposed methodology of context-sensitive domain ontology
extraction. A text corpus is parsed to analyze the lexico-syntactic elements. For instance,
stop words such as “a, an, the” are removed from the source documents since these words
appear in any contexts and they cannot provide useful information to describe a domain
concept. For our implementation, a stop word file is constructed based on the standard
stop word file used in the SMART retrieval system [20]. Lexical pattern is identified by
applying Part of Speech (POS) tagging to the source documents and then followed by
token stemming based on the Porter stemming algorithm [19]. We refer to the WordNet
lexicon [13] to tag each word during this process. During the linguistic pattern filtering
stage, certain linguistic patterns are extracted based on the specific requirements specified
by the ontology engineers. For example, the ontology engineers may only focus on the
“Noun Noun” and “Adjective Noun” patterns instead of all the linguistic patterns. This is
in fact a good way to gain computational efficiency by reducing the number of patterns
for further statistical analysis. In addition, to extract relevant domain specific concepts,
the appearances of concepts across different domains should be taken into account. The
basic intuition is that a concept frequently appears in a specific domain (corpus) rather
than many different domains is more likely to be a relevant domain concept. The
statistical Token Analysis step employs the information theoretic measure to compute the
cooccurrence statistics of the targeting linguistic patterns. Finally, taxonomy of domain
concepts is developed according to the subsumption based fuzzy computational method.
The details of the proposed ontology extraction method will be discussed in Section 4.


0100090000037c00000002001c00000000000400000003010800050000000b0200000000
050000000c02f806170e040000002e0118001c000000fb021000070000000000bc0200000
0860102022253797374656d0006170e0000f8bf1100fce2d430481a310a0c020000170e00
00040000002d0100001c000000fb02a8ff0000000000009001000000000440001254696d6
573204e657720526f6d616e0000000000000000000000000000000000040000002d01010
00400000002010100040000002d010100050000000902000000020d000000320a6000000
00100040000000000100ef50620b64b00040000002d010000030000000000
                Figure 1: Context Sensitive Ontology Extraction Process




              LEXICO-SYNTACTIC AND STATISTICAL ANALYSIS
After standard document preprocessing such as stop word removal, POS tagging, and
word stemming [21], a windowing process is conducted over the collection of documents.
This makes our method quite different from the approach developed by Sanderson and
Croft [22] which does not take into account the proximity between tokens. The proximity
factor is a key to reduce the number of noisy term relationships. For each document (e.g.,
Net news, Web page, email, etc.), a virtual window of δ words is moved from left to right
one word at a time until the end of a sentence is reached. Within each window, the
statistical information among tokens is collected to develop collocational expressions.
Such a windowing process has successfully been applied to text mining before [9]. The
windowing process is repeated for each document until the entire collection has been
processed. According to previous studies, a text window of 5 to 10 terms is effective [8,
17], and so we adopt this range as the basis to perform our windowing process. To
improve computational efficiency and filter noisy relations, only the specific linguistic
pattern (e.g., Noun Noun, and Adjective Noun) defined by an ontology engineer will be
analyzed.
For statistical token analysis, Mutual Information (MI) is adopted as the basic
computational method. Mutual Information has been applied to collocational analysis


                                             3
[17, 24] in previous research. Mutual Information is an information theoretic method to
compute the dependency between two entities and is defined by [23]:

                                                                                   Pr( ti , t j )
                                                MI ( ti , t j ) = log 2
                                                                                Pr( ti ) Pr(t j )
                                                                                                    (1)
        MI (t i , t j )                                                       Pr(ti , t j )
where              is the mutual information between term ti and term tj.                   is the
joint probability that both terms appear in a text window, and Pr(t i ) is the probability
                                                                                              wt
that a term ti appears in a text window. The probability Pr(t i ) is estimated based on w
where wt is the number of windows containing the term t and |w| the total number of
                                                                             Pr(ti , t j )
windows constructed from a textual database (i.e., a collection). Similarly,               is the
fraction of the number of windows containing both terms out of the total number of
windows.


We propose Balanced Mutual Information (BMI) to compute the association weights
among tokens. This method considers both term presence and term absence as the
evidence of the implicit term relationships.


                Ass(t i , t j ) ≈ BMI (t i , t j )
                                                Pr(t i , t j )
               = α × Pr(t i , t j ) log 2 (                       + 1) +
                                              Pr(t i ) Pr(t j )
                                                          Pr( ¬t i , ¬t j )
               (1 − α ) × Pr( ¬t i , ¬t j ) log 2 (                             + 1)
                                                        Pr( ¬t i ) Pr( ¬t j )
                                                                                                    (2)


where Ass(ti, tj) is the association weight between term ti and term tj. Such an association
                                                Pr(ti , t j )
value is approximated by the BMI score.                       is the joint probability that both terms
                                   Pr( ¬t i , ¬t j )
appear in a text window, and                         is the joint probability that both terms are
absent in a text window. The factor α > 0.5 is a weight assigned to the positively
associated mutual information. As it is counterintuitive to have a zero BMI value if two
                                                                                         Pr(t i , t j )
                                                                          Pr(t i ) Pr(t j )
terms always appear Pr(ti,tj ) together in every text window, the faction                   is
adjusted by adding the constant 1 before applying the logarithm. Since our text mining
method is applied after removing stop words, such an adjustment is reasonable to capture
the intuition of significant term co-occurrence. In Eq.(2), each MI value is then
normalized by the corresponding joint probabilities. Only a feature with an association
weight greater than a threshold µ (i.e., Ass(ti, tj) > µ) will be considered a significant
feature for representing a concept in a context vector. After computing all the BMI values
in a collection, these values are subject to linear scaling such that each term association
                                      ∀ Ass(t i , t j ) ∈[0,1]
weight is within the unit interval ti ,tj                      . In should be noted that the
constituent terms of a concept are always implicitly included in the underlying context
vector with a default association weight of 1.
The final stage towards our ontology extraction method is taxonomy generation based on
subsumption relations among extracted concepts. Let Spec(cx, cy) denotes that concept
cx is a specialization (subclass) of another concept cy . The degree of such a
specialization is derived by:



                                           ∑ Ass(t
                                   tx ∈cx ,ty ∈cy ,tx = ty
                                                              x   , c x ) ⊗ Ass(t y , c y )
             Spec( c x , c y ) =
                                                      ∑ Ass(t          x   , cx )
                                                     tx ∈cx                                   (3)

where ⊗ is a standard fuzzy conjunction operator which is equivalent to a minimum
function. The above formula states that the degree of subsumption (specificity) of cx to
cy is based on the ratio of the sum of the minimal association weights of the common
features of the two concepts to the sum of the feature weights of the concept cx. For
instance, if every property term of cx is also a property term of cy , a high specificity
value will be derived. In general, Spec(cx, cy) takes its values from the unit interval [0, 1]
and it is an asymmetric relation. Since it is a fuzzy rather than a crisp relation, Spec(cy ,
cx) may also be true to certain degree. When the taxonomy graph is developed, we only
select the subsumption relation such that Spec(cx, cy) > Spec(cy , cx) and Spec(cx, cy ) >
λ where λ is a threshold to distinguish significant subsumption relations. If Spec(cx, cy) =
Spec(cy , cx) and Spec(cx, cy) > λ is established, the equivalent relation between cx and
cy will be extracted.


                                                      EVALUATION
A small scale experiment of testing the functionality of the prototype system and the
accuracy of the aforementioned domain ontology discovery method has been conducted
under a e-Learning environment. A group of ten undergraduate students were recruited to
try the prototype system; all of these subjects have attended a course in knowledge
management before. At the beginning of the experiment, they attended a briefing session
of fifteen minutes to learn the objective of this experiment and were instructed to write
the most important concepts in knowledge management using concise and precise
statements on an on-line discussion board. They were given thirty minutes to write their
messages. After the message generation session, the coordinator of the experiment would
execute the ontology discovery method to discover the domain knowledge from the on-
line discussion messages which are about the writers' perception of knowledge


                                                                      5
management. For this experiment, only the "Noun Noun" linguistic pattern was specified
and the parameter α = 0.5 was set. Each subject could then view the system generated
ontology on-line. Following the viewing session, a questionnaire was distributed to each
subject to let them assess if the ontology could really reflect their understanding about the
chosen topic. Our questionnaire was developed based on the instrument employed by [3].
It
I included the assessment of the following factors:




  Accuracy - Whether the concepts and relationships shown at the taxonomy are correct;

  Cohesiveness - Whether each concept at the taxonomy is unique and not overlapped
  with one another;
  w




  Isolation - Whether the concepts at the same level are distinguishable and not subsume
  one another;
  o


  Hierarchy - Whether the taxonomy is traversed from broader concepts at the higher
  levels to narrow concepts at the lower level;
  l




  Readability - Whether the concepts at all levels are easy to be comprehended by
  human;


A five point semantic differential scale from very good (5), good (4), average (3), bad
(2), to very poor (1) is used to measure these variables. In general, a score close to 5
indicates that the automatically generated ontology is with good quality and it correctly
reflects the mental state of the subjects. The average scores pertaining to various factors
are shown in Table 1. The overall mean score is 4.3 with a standard deviation of 0.61; the
overall mean score is close to the maximum 5. There are 26 key concepts and 97
relationships spreading at 3 levels. From this initial experiment, it is shown that our
domain ontology discovery method is promising since it can automatically discover the
prominent concepts and their relationships from a set of messages written in natural
language.




                                               Mean STD

                               Accuracy        4.4     0.66
Cohesiveness 4.3         0.46
                               Isolation          4.2   0.60
                               Hierarchy          4.0   0.63
                               Readability        4.5   0.67

                               Overall            4.3   0.61


         Table 1: Qualitative Evaluation of Automatically Discovered Ontology


                                     CONCLUSIONS
The manipulation and exchange of semantically enriched business intelligence (e.g.,
products, services, markets, etc.) can enhance the quality of an eCommerce system and
offer a high level of interoperability among different enterprise systems. Ontology
certainly plays an important role in the formalization of business knowledge. However,
the biggest challenge for the wide spread applications of ontologies is on the construction
of these ontologies because it is a very labor intensive and time consuming process. This
paper illustrates a novel automatic ontology extraction method to facilitate the ontology
engineering process. In particular, contextual information of a domain is exploited so that
more reliable domain ontology can be extracted. The proposed extraction method
combines lexico-syntactic and statistical learning approaches so as to reduce the chance
of generating noisy relations and to improve the computational efficiency. Empirical
studies have been performed to evaluate the quality of the domain ontology extracted by
the proposed ontology discovery method. Our preliminary experiment shows that the
extracted domain ontology can accurately reflect the domain knowledge embedded in on-
line discussion messages. Future work involves comparing the accuracy and the
computational efficiency of our extraction method with that of the other approaches. In
addition, larger scale of quantitative evaluation of our ontology extraction method will be
conducted.



                                     REFERENCES
 [1] T. BernersLee, J. Hendler, and O. Lassila. The semantic web. Scientific American,
    284(5):34–43, 2001.
 [2] Shan Chen, Damminda Alahakoon, and Maria Indrawan. Background knowledge driven
     ontology discovery. In Proceedings of the 2005 IEEE International Conference on e-
     Technology, eCommerce and eService, pages 202–207, 2005.
 [3] S. Chuang and L. Chien. Taxonomy generation for text segments: A practical webbased
     approach. ACM Transactions on Information Systems, 23(4):363–396, 2005.
 [4] P. Cimiano, A. Hotho, and S. Staab. Learning concept hierarchies from text corpora using


                                              7
formal concept analysis. Journal of Artificial Intelligence Research, 24:305–339, 2005.
 [5] The World Wide Web Consortium. Web Ontology Language, 2004. Available from
    http://www.w3.org/2004/OWL/.
 [6] Michael Dittenbach, Helmut Berger, and Dieter Merkl. Improving domain ontologies by
     mining semantics from text. In Proceedings of the First AsiaPacific Conference on
     Conceptual Modelling (APCCM2004), pages 91–100, 2004.
 [7] T. R. Gruber. A translation approach to portable ontology specifications. Knowledge
     Acquisition, 5(2):199–220, 1993.
 [8] Hongyan Jing and Evelyne Tzoukermann.        Information retrieval based on context
     distance and morphology. In Proceedings of the 22nd Annual International ACM SIGIR
     Conference on Research and Development in Information Retrieval, Language Analysis,
     pages 90–96, 1999.
 [9] R.Y.K. Lau. ContextSensitive Text Mining and Belief Revision for Intelligent Information
     Retrieval on the Web. Web Intelligence and Agent Systems An International Journal, 1(3-
     4):1–22, 2003.
[10] Taehee Lee, Ig hoon Lee, Suekyung Lee, Sang goo Lee, Dongkyu Kim, Jonghoon Chun,
     Hyunja Lee, and Junho Shim. Building an operational product ontology system. Electronic
     Commerce Research and Applications, 5(1):16–28, 2006.
[11] Alexander Maedche and Steffen Staab. Ontology learning for the semantic web. IEEE
    Intelligent Systems, 16(2):72–79, 2001.
[12] Alexander Maedche and Steffen Staab. Ontology learning. In Handbook on Ontologies,
     pages 173– 190. 2004.
[13] G. A. Miller, Beckwith R., C. Fellbaum, D. Gross, and K. J. Miller. Introduction to wordnet:
     An online lexical database.     Journal of Lexicography, 3(4):234–244, 1990.
[14] Roberto Navigli, Paola Velardi, and Aldo Gangemi.     Ontology     learning   and     its
     application to automated terminology translation. IEEE Intelligent Systems, 18(1):22–31,
     2003.
[15] I. Nonaka. A dynamic theory of organizational knowledge creation. Organization Science,
     5(1):14–37, 1994.
[16] I. Nonaka and H. Takeuchi.   The Knowledge Creating Company: How Japanese
     Companies Create the Dynamics of Innovation. Oxford University Press, New York,
     1995Oxford University Press.
[17] Patrick Perrin and Frederick Petry.     Extraction and representation of contextual
     information for knowledge discovery in texts. Information Sciences, 151:125–152, 2003.
[18] H. S. Pinto and J. P. Martins. Ontologies: How can they be built? Knowledge and
    Information Systems, 6(4):441–464, 2004.
[19] M. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
[20] G. Salton. Full text information processing using the smart system. Database Engineering
     Bulletin, 13(1):2–9, March 1990.
[21] G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGrawHill,
     New York, New York, 1983.
[22] M. Sanderson and B. Croft. Deriving concept hierarchies from text. In Proceedings of the
     22nd Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, pages 206–213. ACM, 1999.
[23] C. Shannon. A mathematical theory of communication. Bell System Technology Journal,
     27:379–423, 1948.
[24] Mark A. Stairmand. Textual context analysis for information retrieval. In Proceedings of the
     20th Annual International ACM SIGIR Conference on Research and Development in
     Information Retrieval, pages 140–147, 1997.
[25] Christopher A. Welty. Ontology research. AI Magazine, 24(3):11–12, 2003.
[26] Rudolf Wille. Formal concept analysis as mathematical theory of concepts and concept
     hierarchies. In Bernhard Ganter, Gerd Stumme, and Rudolf Wille, editors, Formal Concept
     Analysis, Foundations and Applications, volume 3626, pages 1–33. Springer, 2005.
[27] J. Xu and W.B. Croft. Query expansion using local and global document analysis. In Hans-
     Peter Frei, Donna Harman, Peter Schauble, and Ross Wilkinson, editors, Proceedings of the
     19th Annual International ACM SIGIR Conference on Research and Development in
     Information Retrieval, pages 4–11, Zurich, Switzerland, 1996.




                                                9

Más contenido relacionado

La actualidad más candente

Dr31564567
Dr31564567Dr31564567
Dr31564567
IJMER
 
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
ijcseit
 
Enriching search results using ontology
Enriching search results using ontologyEnriching search results using ontology
Enriching search results using ontology
IAEME Publication
 

La actualidad más candente (20)

Improved method for pattern discovery in text mining
Improved method for pattern discovery in text miningImproved method for pattern discovery in text mining
Improved method for pattern discovery in text mining
 
Improved method for pattern discovery in text mining
Improved method for pattern discovery in text miningImproved method for pattern discovery in text mining
Improved method for pattern discovery in text mining
 
IRJET- Semantic based Automatic Text Summarization based on Soft Computing
IRJET- Semantic based Automatic Text Summarization based on Soft ComputingIRJET- Semantic based Automatic Text Summarization based on Soft Computing
IRJET- Semantic based Automatic Text Summarization based on Soft Computing
 
A Review on Text Mining in Data Mining
A Review on Text Mining in Data MiningA Review on Text Mining in Data Mining
A Review on Text Mining in Data Mining
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
 
The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...
 
A Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and DocumentationA Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and Documentation
 
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect match
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect matchLinked Open (Geo)Data and the Distributed Ontology Language – a perfect match
Linked Open (Geo)Data and the Distributed Ontology Language – a perfect match
 
G04124041046
G04124041046G04124041046
G04124041046
 
Dr31564567
Dr31564567Dr31564567
Dr31564567
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
 
A N E XTENSION OF P ROTÉGÉ FOR AN AUTOMA TIC F UZZY - O NTOLOGY BUILDING U...
A N  E XTENSION OF  P ROTÉGÉ FOR AN AUTOMA TIC  F UZZY - O NTOLOGY BUILDING U...A N  E XTENSION OF  P ROTÉGÉ FOR AN AUTOMA TIC  F UZZY - O NTOLOGY BUILDING U...
A N E XTENSION OF P ROTÉGÉ FOR AN AUTOMA TIC F UZZY - O NTOLOGY BUILDING U...
 
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
 
Enriching search results using ontology
Enriching search results using ontologyEnriching search results using ontology
Enriching search results using ontology
 
Data Integration Ontology Mapping
Data Integration Ontology MappingData Integration Ontology Mapping
Data Integration Ontology Mapping
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
Ontology mapping for the semantic web
Ontology mapping for the semantic webOntology mapping for the semantic web
Ontology mapping for the semantic web
 
Visualizing stemming techniques on online news articles text analytics
Visualizing stemming techniques on online news articles text analyticsVisualizing stemming techniques on online news articles text analytics
Visualizing stemming techniques on online news articles text analytics
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching technique
 

Destacado

Motivated Machine Learning for Water Resource Management
Motivated Machine Learning for Water Resource ManagementMotivated Machine Learning for Water Resource Management
Motivated Machine Learning for Water Resource Management
butest
 
Applying Support Vector Learning to Stem Cells Classification
Applying Support Vector Learning to Stem Cells ClassificationApplying Support Vector Learning to Stem Cells Classification
Applying Support Vector Learning to Stem Cells Classification
butest
 
Resume(short)
Resume(short)Resume(short)
Resume(short)
butest
 
SHFpublicReportfinal_WP2.doc
SHFpublicReportfinal_WP2.docSHFpublicReportfinal_WP2.doc
SHFpublicReportfinal_WP2.doc
butest
 
LECTURE8.PPT
LECTURE8.PPTLECTURE8.PPT
LECTURE8.PPT
butest
 
STEFANO CARRINO
STEFANO CARRINOSTEFANO CARRINO
STEFANO CARRINO
butest
 
KM.doc
KM.docKM.doc
KM.doc
butest
 
Word accessible - .:: NIB | National Industries for the Blind ::.
Word accessible - .:: NIB | National Industries for the Blind ::.Word accessible - .:: NIB | National Industries for the Blind ::.
Word accessible - .:: NIB | National Industries for the Blind ::.
butest
 
Introduction
IntroductionIntroduction
Introduction
butest
 

Destacado (9)

Motivated Machine Learning for Water Resource Management
Motivated Machine Learning for Water Resource ManagementMotivated Machine Learning for Water Resource Management
Motivated Machine Learning for Water Resource Management
 
Applying Support Vector Learning to Stem Cells Classification
Applying Support Vector Learning to Stem Cells ClassificationApplying Support Vector Learning to Stem Cells Classification
Applying Support Vector Learning to Stem Cells Classification
 
Resume(short)
Resume(short)Resume(short)
Resume(short)
 
SHFpublicReportfinal_WP2.doc
SHFpublicReportfinal_WP2.docSHFpublicReportfinal_WP2.doc
SHFpublicReportfinal_WP2.doc
 
LECTURE8.PPT
LECTURE8.PPTLECTURE8.PPT
LECTURE8.PPT
 
STEFANO CARRINO
STEFANO CARRINOSTEFANO CARRINO
STEFANO CARRINO
 
KM.doc
KM.docKM.doc
KM.doc
 
Word accessible - .:: NIB | National Industries for the Blind ::.
Word accessible - .:: NIB | National Industries for the Blind ::.Word accessible - .:: NIB | National Industries for the Blind ::.
Word accessible - .:: NIB | National Industries for the Blind ::.
 
Introduction
IntroductionIntroduction
Introduction
 

Similar a 2007bai7604.doc.doc

Concept integration using edit distance and n gram match
Concept integration using edit distance and n gram match Concept integration using edit distance and n gram match
Concept integration using edit distance and n gram match
ijdms
 
Ck32985989
Ck32985989Ck32985989
Ck32985989
IJMER
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extraction
unyil96
 
Association Rule Mining Based Extraction of Semantic Relations Using Markov L...
Association Rule Mining Based Extraction of Semantic Relations Using Markov L...Association Rule Mining Based Extraction of Semantic Relations Using Markov L...
Association Rule Mining Based Extraction of Semantic Relations Using Markov L...
IJwest
 
Text mining open source tokenization
Text mining open source tokenizationText mining open source tokenization
Text mining open source tokenization
aciijournal
 
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING
cscpconf
 
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
IJwest
 
Text Mining: open Source Tokenization Tools � An Analysis
Text Mining: open Source Tokenization Tools � An AnalysisText Mining: open Source Tokenization Tools � An Analysis
Text Mining: open Source Tokenization Tools � An Analysis
aciijournal
 

Similar a 2007bai7604.doc.doc (20)

Concept integration using edit distance and n gram match
Concept integration using edit distance and n gram match Concept integration using edit distance and n gram match
Concept integration using edit distance and n gram match
 
Ck32985989
Ck32985989Ck32985989
Ck32985989
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extraction
 
Mining knowledge graphs to map heterogeneous relations between the internet o...
Mining knowledge graphs to map heterogeneous relations between the internet o...Mining knowledge graphs to map heterogeneous relations between the internet o...
Mining knowledge graphs to map heterogeneous relations between the internet o...
 
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
 
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAIN
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAINAN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAIN
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAIN
 
A Model For Semantic Annotation Of Environmental Resources The Tatoo Semanti...
A Model For Semantic Annotation Of Environmental Resources  The Tatoo Semanti...A Model For Semantic Annotation Of Environmental Resources  The Tatoo Semanti...
A Model For Semantic Annotation Of Environmental Resources The Tatoo Semanti...
 
Association Rule Mining Based Extraction of Semantic Relations Using Markov L...
Association Rule Mining Based Extraction of Semantic Relations Using Markov L...Association Rule Mining Based Extraction of Semantic Relations Using Markov L...
Association Rule Mining Based Extraction of Semantic Relations Using Markov L...
 
An adaptation of Text2Onto for supporting the French language
An adaptation of Text2Onto for supporting  the French language An adaptation of Text2Onto for supporting  the French language
An adaptation of Text2Onto for supporting the French language
 
SWSN UNIT-3.pptx we can information about swsn professional
SWSN UNIT-3.pptx we can information about swsn professionalSWSN UNIT-3.pptx we can information about swsn professional
SWSN UNIT-3.pptx we can information about swsn professional
 
Recruitment Based On Ontology with Enhanced Security Features
Recruitment Based On Ontology with Enhanced Security FeaturesRecruitment Based On Ontology with Enhanced Security Features
Recruitment Based On Ontology with Enhanced Security Features
 
Association Rule Mining Based Extraction of Semantic Relations Using Markov ...
Association Rule Mining Based Extraction of  Semantic Relations Using Markov ...Association Rule Mining Based Extraction of  Semantic Relations Using Markov ...
Association Rule Mining Based Extraction of Semantic Relations Using Markov ...
 
Ontology Construction from Text: Challenges and Trends
Ontology Construction from Text: Challenges and TrendsOntology Construction from Text: Challenges and Trends
Ontology Construction from Text: Challenges and Trends
 
Ontology learning techniques and applications computer science thesis writing...
Ontology learning techniques and applications computer science thesis writing...Ontology learning techniques and applications computer science thesis writing...
Ontology learning techniques and applications computer science thesis writing...
 
Text mining open source tokenization
Text mining open source tokenizationText mining open source tokenization
Text mining open source tokenization
 
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSIS
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSISTEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSIS
TEXT MINING: OPEN SOURCE TOKENIZATION TOOLS – AN ANALYSIS
 
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING
 
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
Towards From Manual to Automatic Semantic Annotation: Based on Ontology Eleme...
 
Text Mining: open Source Tokenization Tools � An Analysis
Text Mining: open Source Tokenization Tools � An AnalysisText Mining: open Source Tokenization Tools � An Analysis
Text Mining: open Source Tokenization Tools � An Analysis
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
 

Más de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Más de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

2007bai7604.doc.doc

  • 1. ONTOLOGY EXTRACTION FOR BUSINESS KNOWLEDGE MANAGEMENT Dr. Raymond Y.K. Lau Department of Information Systems City University of Hong Kong Tat Chee Avenue, Kowloon Hong Kong Email: raylau@cityu.edu.hk Phone: +852 2788 8495 FAX: +852 2788 8694 ABSTRACT Ontology plays an important role in capturing and disseminating business information (e.g., products, services, relationships of businesses) for effective human computer interactions. However, engineering of domain ontologies is very labor intensive and time consuming. Some machine learning methods have been explored for automatic or semi-automatic discovery of domain ontologies. Nevertheless, both the accuracy and the computational efficiency of these methods need to be improved to support large scale ontology construction for real-world business applications. This paper illustrates a novel domain ontology discovery method which exploits contextual information of the knowledge sources to construct domain ontology with better quality. The proposed ontology discovery method has been empirically tested in an e-Learning environment and the experimental results are encouraging. Keywords: Domain Ontology, Ontology Discovery, Statistical Learning, Knowledge Management. INTRODUCTION Knowledge has been recognized as the most important corporate asset and it is the key for organizations to achieve sustainable competitive advantage. Knowledge management is a collection of processes that govern the creation, dissemination, and utilization of knowledge [15, 16]. To be able to effectively manage the intellectual capital, businesses need an effective approach to identify and capture information and knowledge about business processes, products, services, markets, customers, suppliers, and competitors, and to share this knowledge to improve the organizations’ goal achievement. Ontologies allow domain knowledge such as products, services, markets, etc. to be captured in an explicit and formal way such that it can be shared among human and computer systems. The notion of ontology is becoming very useful in various fields such as intelligent information extraction and retrieval, cooperative information systems, electronic 1
  • 2. commerce, and knowledge management [25]. Since Tim BernersLee, the inventor of the World Wide Web (Web), coined the vision of a Semantic Web [1] in which background information of Web resources is stored in the form of machine processable metadata, the proliferation of ontologies has under tremendous growth. The success of Semantic Web relies heavily on formal ontologies to structure data for comprehensive and transportable machine understanding [11]. Although there is not a universal consensus on the definition of ontology, it is generally accepted that ontology is a specification of conceptualization [7]. In other words, ontology is a formal representation of concepts and their interrelationships. It provides a view of the world that we wish to represent for some purposes [18]. Ontology can take the simple form of a taxonomy (i.e., knowledge encoded in a minimal hierarchical structure) or a vocabulary with standardized machine interpretable terminology supplemented with natural language definitions. On the other hand, the notion of ontology can also be used to describe a logical domain theory with very expressive, complex, and meaningful information. Ontology is often specified in a declarative form by using semantic markup languages such as RDF and OWL [5]. Although ontologies are useful in many areas, the engineering of ontologies turns out to be very expensive and time consuming. Therefore, many automatic or semiautomatic ontology engineering techniques have been proposed. Automated ontology discovery is vital for the success of ontology engineering because it deals with the knowledge acquisition bottleneck which is a classical knowledge engineering problem. Although fully automatic construction of perfect domain ontology is beyond the current stateof- the-art, we believe that the automatic ontology extraction method illustrated in this paper can assist ontology engineers to build domain ontology quicker and more accurately. Some learning techniques have been applied to the extraction of domain ontology [2, 6, 22]. Nevertheless, these methods are still subject to further enhancement in terms of computational efficiency and accuracy. One of the ways to improve domain ontology extraction is to exploit contextual information from the knowledge sources such as the Internet news about business products, services, and markets. As domain ontology captures domain (context) dependent information, an effective extraction method should exploit contextual information in order to build relevant ontologies. AN OVERVIEW OF THE ONTOLOGY DISCOVERY METHODOLOGY Figure 1 depicts the proposed methodology of context-sensitive domain ontology extraction. A text corpus is parsed to analyze the lexico-syntactic elements. For instance, stop words such as “a, an, the” are removed from the source documents since these words appear in any contexts and they cannot provide useful information to describe a domain concept. For our implementation, a stop word file is constructed based on the standard stop word file used in the SMART retrieval system [20]. Lexical pattern is identified by applying Part of Speech (POS) tagging to the source documents and then followed by token stemming based on the Porter stemming algorithm [19]. We refer to the WordNet lexicon [13] to tag each word during this process. During the linguistic pattern filtering
  • 3. stage, certain linguistic patterns are extracted based on the specific requirements specified by the ontology engineers. For example, the ontology engineers may only focus on the “Noun Noun” and “Adjective Noun” patterns instead of all the linguistic patterns. This is in fact a good way to gain computational efficiency by reducing the number of patterns for further statistical analysis. In addition, to extract relevant domain specific concepts, the appearances of concepts across different domains should be taken into account. The basic intuition is that a concept frequently appears in a specific domain (corpus) rather than many different domains is more likely to be a relevant domain concept. The statistical Token Analysis step employs the information theoretic measure to compute the cooccurrence statistics of the targeting linguistic patterns. Finally, taxonomy of domain concepts is developed according to the subsumption based fuzzy computational method. The details of the proposed ontology extraction method will be discussed in Section 4. 0100090000037c00000002001c00000000000400000003010800050000000b0200000000 050000000c02f806170e040000002e0118001c000000fb021000070000000000bc0200000 0860102022253797374656d0006170e0000f8bf1100fce2d430481a310a0c020000170e00 00040000002d0100001c000000fb02a8ff0000000000009001000000000440001254696d6 573204e657720526f6d616e0000000000000000000000000000000000040000002d01010 00400000002010100040000002d010100050000000902000000020d000000320a6000000 00100040000000000100ef50620b64b00040000002d010000030000000000 Figure 1: Context Sensitive Ontology Extraction Process LEXICO-SYNTACTIC AND STATISTICAL ANALYSIS After standard document preprocessing such as stop word removal, POS tagging, and word stemming [21], a windowing process is conducted over the collection of documents. This makes our method quite different from the approach developed by Sanderson and Croft [22] which does not take into account the proximity between tokens. The proximity factor is a key to reduce the number of noisy term relationships. For each document (e.g., Net news, Web page, email, etc.), a virtual window of δ words is moved from left to right one word at a time until the end of a sentence is reached. Within each window, the statistical information among tokens is collected to develop collocational expressions. Such a windowing process has successfully been applied to text mining before [9]. The windowing process is repeated for each document until the entire collection has been processed. According to previous studies, a text window of 5 to 10 terms is effective [8, 17], and so we adopt this range as the basis to perform our windowing process. To improve computational efficiency and filter noisy relations, only the specific linguistic pattern (e.g., Noun Noun, and Adjective Noun) defined by an ontology engineer will be analyzed. For statistical token analysis, Mutual Information (MI) is adopted as the basic computational method. Mutual Information has been applied to collocational analysis 3
  • 4. [17, 24] in previous research. Mutual Information is an information theoretic method to compute the dependency between two entities and is defined by [23]: Pr( ti , t j ) MI ( ti , t j ) = log 2 Pr( ti ) Pr(t j ) (1) MI (t i , t j ) Pr(ti , t j ) where is the mutual information between term ti and term tj. is the joint probability that both terms appear in a text window, and Pr(t i ) is the probability wt that a term ti appears in a text window. The probability Pr(t i ) is estimated based on w where wt is the number of windows containing the term t and |w| the total number of Pr(ti , t j ) windows constructed from a textual database (i.e., a collection). Similarly, is the fraction of the number of windows containing both terms out of the total number of windows. We propose Balanced Mutual Information (BMI) to compute the association weights among tokens. This method considers both term presence and term absence as the evidence of the implicit term relationships. Ass(t i , t j ) ≈ BMI (t i , t j ) Pr(t i , t j ) = α × Pr(t i , t j ) log 2 ( + 1) + Pr(t i ) Pr(t j ) Pr( ¬t i , ¬t j ) (1 − α ) × Pr( ¬t i , ¬t j ) log 2 ( + 1) Pr( ¬t i ) Pr( ¬t j ) (2) where Ass(ti, tj) is the association weight between term ti and term tj. Such an association Pr(ti , t j ) value is approximated by the BMI score. is the joint probability that both terms Pr( ¬t i , ¬t j ) appear in a text window, and is the joint probability that both terms are absent in a text window. The factor α > 0.5 is a weight assigned to the positively associated mutual information. As it is counterintuitive to have a zero BMI value if two Pr(t i , t j ) Pr(t i ) Pr(t j ) terms always appear Pr(ti,tj ) together in every text window, the faction is adjusted by adding the constant 1 before applying the logarithm. Since our text mining method is applied after removing stop words, such an adjustment is reasonable to capture the intuition of significant term co-occurrence. In Eq.(2), each MI value is then
  • 5. normalized by the corresponding joint probabilities. Only a feature with an association weight greater than a threshold µ (i.e., Ass(ti, tj) > µ) will be considered a significant feature for representing a concept in a context vector. After computing all the BMI values in a collection, these values are subject to linear scaling such that each term association ∀ Ass(t i , t j ) ∈[0,1] weight is within the unit interval ti ,tj . In should be noted that the constituent terms of a concept are always implicitly included in the underlying context vector with a default association weight of 1. The final stage towards our ontology extraction method is taxonomy generation based on subsumption relations among extracted concepts. Let Spec(cx, cy) denotes that concept cx is a specialization (subclass) of another concept cy . The degree of such a specialization is derived by: ∑ Ass(t tx ∈cx ,ty ∈cy ,tx = ty x , c x ) ⊗ Ass(t y , c y ) Spec( c x , c y ) = ∑ Ass(t x , cx ) tx ∈cx (3) where ⊗ is a standard fuzzy conjunction operator which is equivalent to a minimum function. The above formula states that the degree of subsumption (specificity) of cx to cy is based on the ratio of the sum of the minimal association weights of the common features of the two concepts to the sum of the feature weights of the concept cx. For instance, if every property term of cx is also a property term of cy , a high specificity value will be derived. In general, Spec(cx, cy) takes its values from the unit interval [0, 1] and it is an asymmetric relation. Since it is a fuzzy rather than a crisp relation, Spec(cy , cx) may also be true to certain degree. When the taxonomy graph is developed, we only select the subsumption relation such that Spec(cx, cy) > Spec(cy , cx) and Spec(cx, cy ) > λ where λ is a threshold to distinguish significant subsumption relations. If Spec(cx, cy) = Spec(cy , cx) and Spec(cx, cy) > λ is established, the equivalent relation between cx and cy will be extracted. EVALUATION A small scale experiment of testing the functionality of the prototype system and the accuracy of the aforementioned domain ontology discovery method has been conducted under a e-Learning environment. A group of ten undergraduate students were recruited to try the prototype system; all of these subjects have attended a course in knowledge management before. At the beginning of the experiment, they attended a briefing session of fifteen minutes to learn the objective of this experiment and were instructed to write the most important concepts in knowledge management using concise and precise statements on an on-line discussion board. They were given thirty minutes to write their messages. After the message generation session, the coordinator of the experiment would execute the ontology discovery method to discover the domain knowledge from the on- line discussion messages which are about the writers' perception of knowledge 5
  • 6. management. For this experiment, only the "Noun Noun" linguistic pattern was specified and the parameter α = 0.5 was set. Each subject could then view the system generated ontology on-line. Following the viewing session, a questionnaire was distributed to each subject to let them assess if the ontology could really reflect their understanding about the chosen topic. Our questionnaire was developed based on the instrument employed by [3]. It I included the assessment of the following factors: Accuracy - Whether the concepts and relationships shown at the taxonomy are correct; Cohesiveness - Whether each concept at the taxonomy is unique and not overlapped with one another; w Isolation - Whether the concepts at the same level are distinguishable and not subsume one another; o Hierarchy - Whether the taxonomy is traversed from broader concepts at the higher levels to narrow concepts at the lower level; l Readability - Whether the concepts at all levels are easy to be comprehended by human; A five point semantic differential scale from very good (5), good (4), average (3), bad (2), to very poor (1) is used to measure these variables. In general, a score close to 5 indicates that the automatically generated ontology is with good quality and it correctly reflects the mental state of the subjects. The average scores pertaining to various factors are shown in Table 1. The overall mean score is 4.3 with a standard deviation of 0.61; the overall mean score is close to the maximum 5. There are 26 key concepts and 97 relationships spreading at 3 levels. From this initial experiment, it is shown that our domain ontology discovery method is promising since it can automatically discover the prominent concepts and their relationships from a set of messages written in natural language. Mean STD Accuracy 4.4 0.66
  • 7. Cohesiveness 4.3 0.46 Isolation 4.2 0.60 Hierarchy 4.0 0.63 Readability 4.5 0.67 Overall 4.3 0.61 Table 1: Qualitative Evaluation of Automatically Discovered Ontology CONCLUSIONS The manipulation and exchange of semantically enriched business intelligence (e.g., products, services, markets, etc.) can enhance the quality of an eCommerce system and offer a high level of interoperability among different enterprise systems. Ontology certainly plays an important role in the formalization of business knowledge. However, the biggest challenge for the wide spread applications of ontologies is on the construction of these ontologies because it is a very labor intensive and time consuming process. This paper illustrates a novel automatic ontology extraction method to facilitate the ontology engineering process. In particular, contextual information of a domain is exploited so that more reliable domain ontology can be extracted. The proposed extraction method combines lexico-syntactic and statistical learning approaches so as to reduce the chance of generating noisy relations and to improve the computational efficiency. Empirical studies have been performed to evaluate the quality of the domain ontology extracted by the proposed ontology discovery method. Our preliminary experiment shows that the extracted domain ontology can accurately reflect the domain knowledge embedded in on- line discussion messages. Future work involves comparing the accuracy and the computational efficiency of our extraction method with that of the other approaches. In addition, larger scale of quantitative evaluation of our ontology extraction method will be conducted. REFERENCES [1] T. BernersLee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):34–43, 2001. [2] Shan Chen, Damminda Alahakoon, and Maria Indrawan. Background knowledge driven ontology discovery. In Proceedings of the 2005 IEEE International Conference on e- Technology, eCommerce and eService, pages 202–207, 2005. [3] S. Chuang and L. Chien. Taxonomy generation for text segments: A practical webbased approach. ACM Transactions on Information Systems, 23(4):363–396, 2005. [4] P. Cimiano, A. Hotho, and S. Staab. Learning concept hierarchies from text corpora using 7
  • 8. formal concept analysis. Journal of Artificial Intelligence Research, 24:305–339, 2005. [5] The World Wide Web Consortium. Web Ontology Language, 2004. Available from http://www.w3.org/2004/OWL/. [6] Michael Dittenbach, Helmut Berger, and Dieter Merkl. Improving domain ontologies by mining semantics from text. In Proceedings of the First AsiaPacific Conference on Conceptual Modelling (APCCM2004), pages 91–100, 2004. [7] T. R. Gruber. A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2):199–220, 1993. [8] Hongyan Jing and Evelyne Tzoukermann. Information retrieval based on context distance and morphology. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Language Analysis, pages 90–96, 1999. [9] R.Y.K. Lau. ContextSensitive Text Mining and Belief Revision for Intelligent Information Retrieval on the Web. Web Intelligence and Agent Systems An International Journal, 1(3- 4):1–22, 2003. [10] Taehee Lee, Ig hoon Lee, Suekyung Lee, Sang goo Lee, Dongkyu Kim, Jonghoon Chun, Hyunja Lee, and Junho Shim. Building an operational product ontology system. Electronic Commerce Research and Applications, 5(1):16–28, 2006. [11] Alexander Maedche and Steffen Staab. Ontology learning for the semantic web. IEEE Intelligent Systems, 16(2):72–79, 2001. [12] Alexander Maedche and Steffen Staab. Ontology learning. In Handbook on Ontologies, pages 173– 190. 2004. [13] G. A. Miller, Beckwith R., C. Fellbaum, D. Gross, and K. J. Miller. Introduction to wordnet: An online lexical database. Journal of Lexicography, 3(4):234–244, 1990. [14] Roberto Navigli, Paola Velardi, and Aldo Gangemi. Ontology learning and its application to automated terminology translation. IEEE Intelligent Systems, 18(1):22–31, 2003. [15] I. Nonaka. A dynamic theory of organizational knowledge creation. Organization Science, 5(1):14–37, 1994. [16] I. Nonaka and H. Takeuchi. The Knowledge Creating Company: How Japanese Companies Create the Dynamics of Innovation. Oxford University Press, New York, 1995Oxford University Press. [17] Patrick Perrin and Frederick Petry. Extraction and representation of contextual information for knowledge discovery in texts. Information Sciences, 151:125–152, 2003. [18] H. S. Pinto and J. P. Martins. Ontologies: How can they be built? Knowledge and Information Systems, 6(4):441–464, 2004. [19] M. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. [20] G. Salton. Full text information processing using the smart system. Database Engineering Bulletin, 13(1):2–9, March 1990. [21] G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGrawHill, New York, New York, 1983. [22] M. Sanderson and B. Croft. Deriving concept hierarchies from text. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in
  • 9. Information Retrieval, pages 206–213. ACM, 1999. [23] C. Shannon. A mathematical theory of communication. Bell System Technology Journal, 27:379–423, 1948. [24] Mark A. Stairmand. Textual context analysis for information retrieval. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 140–147, 1997. [25] Christopher A. Welty. Ontology research. AI Magazine, 24(3):11–12, 2003. [26] Rudolf Wille. Formal concept analysis as mathematical theory of concepts and concept hierarchies. In Bernhard Ganter, Gerd Stumme, and Rudolf Wille, editors, Formal Concept Analysis, Foundations and Applications, volume 3626, pages 1–33. Springer, 2005. [27] J. Xu and W.B. Croft. Query expansion using local and global document analysis. In Hans- Peter Frei, Donna Harman, Peter Schauble, and Ross Wilkinson, editors, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4–11, Zurich, Switzerland, 1996. 9