50120130406039

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME

TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 6, November - December (2013), pp. 355-366
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com

IJCET
©IAEME

USING CLUSTERING MODULE ONTOLOGY CONSTRUCTION FROM
TEXT DOCUMENTS BY SINGULAR VALUE DECOMPOSITION
Shailendra Kumar singh1,

Gunjan Verma2, Devesh Som3, Mohd. Shamsul Haq4

1, 3
2, 4

Meerut Institute of Engineering & Technology, Meerut, Computer Science Department
Meerut Institute of Engineering & Technology, Meerut, Master of Computer Applications

ABSTRACT
Due to the abundance of unstructured data available today, this was an interesting to research
for finding an automated way to retrieve information, or to respond to a structured or an unstructured
query. Domain specific information is useful in processing natural language, information retrieval
and systems reasoning. This paper presents the document retrieval approach based on combination of
latent semantic index (LSI) and a clustering algorithm. The idea is to first retrieve papers and create
initial clusters based on LSI. Then, we use flat clustering method to further group similar documents
in clusters. This paper also presents an algorithm for clustering that aims at dealing with the fact that
is ROCK algorithm. We try to show that ROCK algorithm give the better result. The main advantage
of our method is that it forces the centroid vector towards the extremities, and consequently gets a
completely different starting point compared to the standard algorithm.
Keywords: Latent Semantic Index (LSI), Clustering Algorithm, ROCK Algorithm, SVD.
1. INTRODUCTION
1.1 Semantic Search
Semantic Search attempts to augment and improve traditional search results (based on
Information Retrieval technology) by using data from the Semantic Web.
1.2 Ontologies in Semantic Web
Ontology is a data model, which can be used to describe a set of concepts and the
relationships between those concepts within a domain. Ontologies work as the main component in
knowledge representation for the Semantic Web. Research groups in both America and Europe
developed Ontology modeling languages as The DARPA Agent Markup Language (DAML) and
Ontology Inference Layer (OIL).
355


1.3. Clustering
A large dataset can be divided into a few smaller ones; each contains data that are close in
some sense. This procedure is called clustering, which is a common operation in data mining. In
information and text retrieval, clustering is useful for organization and search of large text
collections, Clustering interfaces employ a fairly new class of algorithms called post-retrieval
document clustering algorithms.
1.4. Goodness Measure
The presented criterion function which can be used to estimate the “goodness” of clusters.
The best clustering of points were those that resulted in the highest values for the criterion function.
Since our goal is to find a clustering that maximizes the criterion function, we use a measure similar
to the criterion function in order to determine the best pair of clusters to merge at each step of
ROCK's hierarchical clustering algorithm. For a pair of clusters Ci,Cj , let link[Ci;Cj] store the
number of cross links between clusters Ciand Cj, that is, ∑pq2Ci;prέCj link(pq; pr). Then, we define
the goodness measure g (Ci; Cj) for merging clusters Ci; Cj as follows.

Fig 1.4 Goodness Measure Formula
This naive approach may work well for well-separated clusters. In order to remedy the
problem, we divide the number of cross links between clusters by the expected number of cross links
between them. Thus, if every point in Ci has nif(ө) neighbours, and then the expected number of links
involving only points in the cluster is approximately ni1+2f(ө) .
2. CLUSTERING ALGORITHM
ROCK's hierarchical clustering algorithm is presented in this algorithm, belongs to the class
of agglomerative hierarchical clustering algorithms It accepts as input the set S of n sampled points
to be clustered and the number of desired clusters k. The procedure begins by computing the number
of links between pairs of points in Step 1. Initially,
Procedure Cluster (S,k)
Begin
1. Link := comput_links(S)
2. For each s € S do
3. Q[s] = build_local_heap(links,s)
4. Q = build_global_heap(S,q)
5. While size(Q) > k do{
6. U = extract_max (Q)
7. V := max(q[u])
8. Delete (Q,v)
9. W := merge(u,v)
10. For each x € q[u] U q[v] do {
11. Link[x,w] := link [x,u] +link [x,v]
356


12. Delete (q[x],u) : delete (q[x],v)
13. Insert(q[x],w,g(x,w)): insert (q[w],x,g(x,w))
14. Update(Q,x,q[x])
15. }
16. Insert (Q,w,q[w])
17. Deallocate (q[u]) : deallocate(q[v])
18. }
each point is a separate cluster. For each cluster i, we build a local heap q[i] and maintain the
heap during the execution of the algorithm. q[i] contains every cluster j such that link[i, j] is nonzero. The clusters j in q[i] are ordered in the decreasing order of the goodness measure with respect
to i, g(i, j).In addition to the local heaps q[i] for each cluster i, the algorithm also maintains an
additional global heap Q that contains all the clusters. Furthermore, the clusters in Q are ordered in
the decreasing order of their best goodness measures. Thus, g(j, max(q[j])) is used to order the
various clusters j in Q, where max(q[j]), the max element in q[j], is the best cluster to merge with
cluster j. At each step, the max cluster j in Q and the max cluster in q[j] are the best pair of clusters to
be merged.
The while-loop in Step 5 iterates until only k clusters remain in the global heap Q. In addition
,it also stops clustering if the number of links between every pair of the remaining clusters becomes
zero. In each step of the while-loop, the max cluster u is extracted from Q by extract max and q[u] is
used to determine the best cluster v for it. Since clusters u and v will be merged, entries for u and v
are no longer required and can be deleted from Q. Clusters u and v are then merged in Step 9 to
create a cluster w containing |u|+|v| points. There are two tasks that need to be carried out once
clusters u and v are merged:
(1) for every cluster that contains u or v in its local heap, the elements u and v need to be replaced
with the new merged cluster w and the local heap needs to be updated, and
(2) a new local heap for w needs to be created. Both these tasks are carried out in the for-loop of Step
10-15. The number of links x and w is simply the sum of the number of links between x and u, and
x and v. This is used to compute q(x,w), the new goodness measure for the pair of clusters x and w,
and the two clusters are inserted into each other's local heaps.
2.1.Computation of Links
One way of viewing the problem of computing links between every pair of points is to
consider an n x n adjacency matrix A in which entry A[i, j] is 1 or 0 depending on whether or not
points I and j, respectively, are neighbours. The number of links between a pair of points i and j can
be obtained by multiplying row i with column j (that is, ∑l=1n=1 A [i,l] *A [l, j]). Thus, the problem
of computing the number of links for all pairs of points is simply that of multiplying the adjacency
matrix A with itself, The time complexity of the naive algorithm to compute the square of a matrix is
O(n3). However the problem of calculating the square of a matrix is a well-studied problem and wellknown algorithms such as Stassen’s algorithm runs in time O (N2.81). The best complexity possible
currently is O (N2.37) due to the algorithm by Coppers field and Winograd.

357


Fig 2.1.1. Computing Links
We expect that, on an average, the number of neighbours for each point will be small
compared to the number of input point’s n, causing the adjacency matrix A to be sparse. For such
sparse matrices, the algorithm in Figure 2.1.1. Provides a more efficient way of computing links. For
every point, after computing a list of its neighbours, the algorithm considers all pairs of its
neighbours. Thus, the complexity of the algorithm is ∑i mi2which is O (nmm ma), where ma and mm
are the average and maximum number of neighbours for a point, respectively. In the worst case, the
value of mm can be n in which case the complexity of the algorithm becomes O.
3. PROBLEM DEFINITION
The main problem in modeling an efficient information retrieval process is that a user cannot
express his information needing straight forwardly in a query which is posted to an information
repository. A user’s query represents just an approximation of his information need. Consequently a
user query should be refined in order to ensure the retrieval of that query as much as relevant
products. Unfortunately, most of the information retrieval systems do not provide a co-operative
support in the query refinement process, so that a user is “forced” to change his query on his own in
order to find the most suitable results. Indeed, although in an interactive query refinement process. A
user is provided with a list of terms that appear frequently in retrieved documents, the explanation of
their impact on the information retrieval process is completely missing.
4. TASKS TO PERFORM
To accomplish this goal we should perform the following tasks:
1. Pre-processing
2. Normalization
3. Latent Semantic Indexing based on SVD.
4. Clustering (ROCK algorithm)
4.1 Pre-processing
Pre-processing involves the reformatting, or filtering, of a text document, to facilitate
meaningful statistical analysis.

358


Table 4.1.1. (a)
Document = vc5.txt
Tokens or Terms

Frequency

word 0 = www

frequency = 4

word 1 = tim

frequency = 4

word 2 = berners

frequency = 4

word 3 = lee

frequency = 4

word 4 = internet

frequency = 1

word 5 = cern

frequency = 1

word 6 = web

frequency = 1

Table 4.1.2 (b)
Document = vc6.txt
word 0 = internet

frequency = 1

word 1 = role

frequency = 1

word 2 = tim

frequency = 1

word 3 = berners

frequency = 1

word 4 = lee

frequency = 1

word 5 = cern

frequency = 4

word 6 = server

frequency = 4

word 7 = web

frequency = 5

Table 4.1.3 (c)
Document = vc7.txt
word 0 = network

frequency = 6

word 1 = security

frequency = 6

word 2 = internet

frequency = 4

word 3 = web

frequency = 4

word 4 = applications

frequency = 4

4.2 Normalization
Normalization is a process in which we calculate the normalized weight of each word that we
have obtained from pre-processing.. Then calculate the weight of each term, by the following
formula:
359


Fig 4.2.1
Where Wi,k is the weight of ith term in kth document and nk is the total number of terms in that
document. This weight corresponds to term weight in a document. But this term weight should be
normalized for the entire set of documents, not for only one document. This is called as normalizing
the term weight. In the next step of Normalization, the weights for each individual document are
combined into normalized analysis of the whole collection of documents. To do this we must take
into account the fact that a term may have a large weight simply because the document in which it
occurs is small, rather than because it occurs frequently throughout the document collection. To
eliminate this problem, the normalized weight of a term is calculated as

Fig 4.2.2
This process is a fairly standard normalization for document length, as explained by
Greengrass [5].The Normalized output of the words occurring more than twice, in the Tested
document of Washington post are shown in Table below. Word - Weight

Column

Document1

Table 4.2.1(a)
Document2
Document3

Terms

Row
Row 0
====
Row 1

0.211

0.000

0.000

word = www

0.211

0.056

0.000

word = tim

====
Row 2
====
Row 3

0.211

0.056

0.000

word = burners

0.211

0.056

0.000

word = lee

====
Row 4
====
Row 5

0.053

0.056

0.167

word = internet

0.053

0.222

0.000

word = cern

====
Row 6
====
Row 7

0.053

0.278

0.167

word = web

0.000

0.056

0.000

word = role

====
Row 8
====
Row 9

0.000

0.222

0.000

word = server

0.000

0.000

0.250

word = network

====
Row 10
====
Row 11

0.000

0.000

0.250

word = security

0.000

0.000

0.167

word =
applications

====
360


Table 4.2.2(b)
Document2 Document3

Column
Row
Row 0 ====

Document1

Terms

0.489

0.000

0.000

word = www

Row 1 ====

0.489

0.127

0.000

word = tim

Row 2 ====

0.489

0.127

0.000

word = berners

Row 3 ====
Row 4 ====

0.489
0.122

0.127
0.127

0.000
0.365

word = lee
word = internet

Row 5 ====

0.122

0.508

0.000

word = cern

Row 6 ====

0.122

0.635

0.365

word = web

Row 7 ====

0.000

0.127

0.000

word = role

Row 8 ====

0.000

0.508

0.000

word = server

Row 9 ====

0.000

0.000

0.548

word = network

Row 10 ====

0.000

0.000

0.548

word = security

Row 11 ====

0.000

0.000

0.365

word = applications

4.3. Creating Term-Document matrix
At this stage we can represent a textual document as a set of meaningful normalized terms.
The normalized term weights, collectively form the matrix W, where Wi,k = NormalizedWeighti,
k. This matrix is referred to as term-document matrix. A term-document matrix contains terms as
rows and documents as columns. In this step, we create term-document matrix that describes the
occurrence of meaningful terms in each document of the collection. To create such a matrix, we keep
track of different meaningful terms occurred in different documents. While reading new text
document, new terms are added into matrix as rows, their frequencies are added in the respective
columns. While reading new document the system should check whether these new terms are
grammatical forms of some previous terms (i.e. eliminating stemming words).The output of this
process is a term-document matrix containing distinct, meaningful terms in the entire collection as
rows, and documents in the collection as columns.
4.4. Latent Semantic Indexing
At this stage we wish to determine a set of concepts from the term-document matrix, where a
concept is defined as set of related terms. We accomplish this by using a method called Latent
Semantic Indexing (LSI), which primarily involves decomposing the matrix W using Singular Value
Decomposition (SVD).
4.4.1. Singular Value Decomposition
SVD is well-known matrix decomposition method. It decomposes the matrix A, i.e. the m x n
term-document matrix, with m terms, and n documents, as A = U*S*VT. Where U is m x r matrix,
called the term matrix, V is r x n matrix, called the document matrix, and S is r x r diagonal matrix
containing singular values of A in its diagonal in descending order. In this decomposition, the
singular values i corresponds to the vector ui, the ith column of U and vi, the ith row of V. Without
361


loss of any generality we can assume that the columns of U, the rows of V, and the diagonal values
of S have been arranged so that the singular values are in descending order, moving down the
diagonal. The decomposition of matrix A is as shown in the figure 4.4.1.

Fig 4.4.1
SVD and LSI we form a new matrix As = Us*Ss*VsT .Detailed information about SVD can
be obtained from SVD packageBerry [1] [2] and Spotting Topics with the SVD, Charles Nicholas
and Randall Dahlberg et.al.[6].
5. CONSTRUCTING DOCUMENT ONTOLOGY
Constructing document ontology is essentially building concept nodes and term nodes from
term matrix (U) and document matrix (V), which we have obtained from SVD.A concept node
represents a concept and contains information about its concept name, terms that belong to that
concept, and their weights in that concept. The name of a concept is generated automatically and is a
hyphenated string of the five most frequent terms in that concept. Each column in document matrix
(U) corresponds to a concept node.
A term node represents a term and contains information about its term name, concept that tit
belongs to, and its weight in different concepts. The name of a term is generated automatically and is
simply the term itself. Each row in document matrix (U) corresponds to a term node.
6. CLUSTERING
The following steps are done to implement the clustering algorithm
Step-1 In this step create the clusters for the given documents
Table 6 (a)
Cluster-Name : CLuster1 and No-of Documents Ni:1
Keyword in The Cluster Are : [, www, tim, berners, lee,
lee, internet, CERN, web]
Documents in the cluster are : i = 0

doc1

Table 6 (b)
Cluster-Name : CLuster2 and No-Of-Documents Ni : 1
Keyword in The Cluster Are : [, internet, role, tim,
berners, lee, CERN, CERN, web, server, server]
Documents in the cluster are : i = 0 doc2

362


Table 6 ( c )
Keyword in The Cluster Are : [network, security,
security, internet, web, applications, applications, ]
Documents in the cluster are : i = 0 doc3

Table 6 (d)
Keyword in The Cluster Are : [, GSM, technology,
related, mobile, communication, cellular, systems, growing,
for, satellite]
Documents in the cluster are : i = 0 doc 4

Table 6 (e)
Keyword in The Cluster Are : [, cellular, network, or,
mobile, radio, distributed, over, land, areas, called, cells,
radio, network, cellular]
Documents in the cluster are : i=0 doc 5
Table 6 (f)
Keyword in The Cluster Are : [frequencies, reused, other,
cells, provided, that, same, not, adjacent, neighboring, as,
would, cause, co, channel, interference, interference, ]
Documents in the cluster are : i = 0

doc 6

Step-2 After creating clusters calculating the adjacency matrix for the given documents which is
given in ROCK algorithm
Table 6 (g)
Initial adjacency_matrix Is
1 1 1 0 0 0
1 1 1 0 0 0
1 1 1 0 1 0
0 0 0 1 0 0
0 0 1 0 1 0
0 0 0 0 0 1
363


Step-3 After that calculating the initial links matrix for the given documents which is given in
ROCK algorithm
Initial link_matrix Is
3 3 0 1 0 0
3 4 0 2 0 0
0 0 1 0 0 0
1 2 0 2 0 0
0 0 0 0 1 0
0 0 0 0 0 0
Table 6 (h)
Step-4 After that calculating the Goodness Measure from the initial links matrix for the given
documents and arrange it into ascending order
Table 6 (i)
For Cluster 0
Goodness Between (C0,C1) = 0.6161829393253857

Table 6 (j)
For Cluster 1

Table 6 (k)
For Cluster 2

Table 6 (l)
For Cluster 3

364


Table 6 (m)
For Cluster 4

Table 6 (n)
For Cluster 5

Step-5 The next step is merging the clusters
Table 6 (o)
index1 = 0 index2 = 1
Merging clusters
cluster 1 =CLuster1 index = 0
cluster 2 =CLuster2 index = 1
Cluster Removed CLuster2

Step-6 Fire the query and retrieve the appropriate cluster
Table 6 (p)
Enter your query or keywords:
Tim
Cluster Matched With Query Is

Table 6 (q)
CLUSTER-NAME : CLuster1 and No-Of-Documents Ni : 2
Keyword in The Cluster Are : [, tim, berners, lee, lee, internet, CERN,
CERN, web]
Documents in the cluster are : doc1- a www and timberners lee www
and timberners lee www and timberners lee www and timberners lee
internet CERN web a a
doc2 - a internet and role of timberners lee . CERNCERNCERN web
server web server web server web server web server

365


7. CONCLUSION
In this paper, a new concept is proposed that makes use of links to measure the similarity /
proximity between pair of data points with categorical attributes. A robust hierarchical clustering
algorithm ROCK is used along with SVD that employs links and node distances for merging clusters.
This method extends to metric similarity measures that are relevant in situations a domain expert /
similarity table is the only source of knowledge. For better results some other clustering algorithm
can be further used to improve the performance of the information retrieval system.
8. ACKNOWLEDGEMENT
Our thanks to the experts and authors and referenced journals who have contributed towards
development of the paper and help us for making the concepts clear.
9. REFERENCES
1.

Berry, M. W. Dumais, S. T. O Brein, G. W. (December 1995). Using leaner Algebra
intelligent information retrieval.
2. Berry, M. W., Do,T., O Brein, Krishna, V., Varadhan, S., SVDPACKC: Version 1.0 User’s
guide, tech. Report CS -93-194, University of Tennesse, Knoxville, TN, October 1993.
3. Dr. Sadanand Srivastava, Dr. James Gill de lamadrid, Yuri Karakshyan, Document Ontology
Extractor, CADIP 00.
4. Dr. Sadanand Srivastava, Dr. James Gill de Lamadrid, Yuri Karakshyan, Document
Ontology from a document using SVD.
5. Greengrass, E. (February 1997). Information Retrieval an Overview.
6. Nicholas, C., Dahlberg, R. (March 1998). Spotting Topics with Singular Value
Decomposition, principles of Digital Document processing, FIARO, St. Malo.
7. A. Maedche, S. Staab, Ontology Learning for the semantic Web, IEEE intelligence Systems
16(2) (2001) 72-79.
8. Golub, G., Luk,F., Overton, M., A Block Lanczos Method for computing the singular values
and corresponding singular vectors of a matrix, ACM transactions on mathematical
software, 7(2),pp.149-169, 1981.
9. SVD and LSI tutorial 1 http://www.miislita.com/information-retrieval-tutorial/svd-lsitutorial-1-lsi-how -to-calculations.html.
10. Vinu P.V., Sherimon P.C. and Reshmy Krishnan, “Development of Seafood Ontology for
Semantically Enhanced Information Retrieval”, International Journal of Computer
Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 154 - 162, ISSN Print:
0976 – 6367, ISSN Online: 0976 – 6375.
11. Meghana. N.Ingole, M.S.Bewoor and S.H.Patil, “Context Sensitive Text Summarization
using Hierarchical Clustering Algorithm”, International Journal of Computer Engineering &
Technology (IJCET), Volume 3, Issue 1, 2012, pp. 322 - 329, ISSN Print: 0976 – 6367,
ISSN Online: 0976 – 6375.
12. Prakasha S, Shashidhar Hr and Dr. G T Raju, “A Survey on Various Architectures, Models
and Methodologies for Information Retrieval”, International Journal of Computer
Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 182 - 194, ISSN Print:
0976 – 6367, ISSN Online: 0976 – 6375.

366

50120130406039

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (7)

Similar a 50120130406039

Similar a 50120130406039 (20)

Más de IAEME Publication

Más de IAEME Publication (20)

Último

Último (20)

50120130406039