SlideShare a Scribd company logo
1 of 4
Download to read offline
Kalyani Desikan et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1492-1495

RESEARCH ARTICLE

www.ijera.com

OPEN ACCESS

Optimal Clustering Scheme For Repeated Bisection Partitional
Algorithm
Kalyani Desikan *, G. Hannah Grace **
*(Department of Mathematics, VIT University, Chennai)
** (Department of Mathematics, VIT University, Chennai)

ABSTRACT
Text clustering divides a set of texts into clusters such that texts within each cluster are similar in content. It
may be used to uncover the structure and content of unknown text sets as well as to give new perspectives on
familiar ones. The focus of this paper is to experimentally evaluate the quality of clusters obtained using
partitional clustering algorithms that employ different clustering schemes. The optimal clustering scheme that
gives clusters of better quality is identified for three standard data sets. Also, the ideal clustering scheme that
optimizes the I2 criterion function is experimentally identified.
Keywords – Cluster quality, Criterion Functions, Entropy and Purity
k

I.

for k clusters, we have

INTRODUCTION

The partitional clustering algorithms are well
suited for clustering large document datasets due to
their relatively low computational requirements
according to study conducted by [1]. A report by [2]
investigated the effect of the criterion functions to the
problem of partitional clustering of documents and the
results showed that different criterion functions lead to
substantially different results. Another study reported
by [3] examined the effect of the criterion functions on
clustering document datasets using partitional and
agglomerative clustering algorithms. Their results
showed that partitional algorithms always led to better
clustering results than agglomerative algorithms
The main focus of this paper is to perform
experimental evaluation of various criterion functions
in the context of the partitional approach, namely the
repeated bisection clustering algorithm.

II.

PRELIMINARIES

1.Cluster Quality
The quality of the clusters produced is
measured using two external measures, namely,
entropy [3][4][5] and purity . Entropy measures how
various classes of documents are distributed within
each cluster. The smaller the entropy values, better the
clustering solution. Given a particular cluster S r of
size nr, the entropy of this cluster is defined in [6] as

E(S r )  

ni
1 q nri
log r

log q i 1 nr
nr
i

where q is the number of classes in the dataset and n r
is the number of documents of the ith class that are
assigned to the rth cluster. The entropy of the entire
clustering is then the sum of the individual cluster
entropies weighted according to the cluster size ,that is

www.ijera.com

nr
E (S r ) .
r 1 n

Entro
py

Smaller entropy values indicate better clustering
solutions. Using the same Mathematical notation, the
purity of a cluster is defined as in [7]

Pu ( S r ) 

1
max nri . The purity of the clustering
nr

solution is again the weighted sum of the individual
k

cluster purities,

Purity  
r 1

nr
Pu ( S r ) . Larger
n

purity value indicates better clustering solution.
2.Clustering Criterion Functions
Seven criterion conditions I1, I2, E1, G1,G1p
,H1 and H2 are mentioned in [2]. Theoretical analysis
of the criterion functions by [8] shows that their
relative performance depends on the (i) degree to
which they can correctly operate when the clusters are
of different tightness, and (ii) degree to which they
can lead to reasonably balanced clusters. The main
role of different clustering criterion functions is to
determine which cluster to bisect next as discussed in
[9].
In our experimental study, we have restricted
our analysis to three criterion functions viz. I2, E1 and
H2. We did not consider I1 since the only difference
between I1 and I2 is that while calculating I2 we take
the square root of the similarity function. We also
ignore G1 and G1p since G1 is similar to E1 except
that there is no square root in the denominator and
G1p is similar to E1 except that we have n12 and that
there is no square root in the denominator. Also, since
H1 is a hybrid function that maximizes I1/E1, we
ignore it as we have not taken I1 and E1 into
consideration.
I2 is an example of an internal criterion
function that maximizes the similarity between each
1492 | P a g e
Kalyani Desikan et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1492-1495
document and the centroid of the cluster to which it is
assigned. E1 is an external criterion function that
focuses on optimizing a function that depends on the
dissimilarity of the clusters. E1 tries to minimize the
similarity between the centroid vector of each cluster
and the centroid vector of the entire collection. The
contribution of each cluster is weighted based on the
cluster size. Combinations of different clustering
criterion functions provides a set of hybrid criterion
functions that simultaneously optimize multiple
individual criterion functions, for example, H2 is
obtained by combining I2 with E1. Detailed
descriptions of these criterion functions can be found
in [2]. Table1 gives the mathematical formulae for the
criterion functions I2, E1 and H2.
Table1
Criterion
Formula
function
k
I2

  sim(u, v)

maximize

i 1

u ,vSi

E1
k

minimize

n
i 1

i

 sim(u, v)

uSi ,vS

 sim(u, v)

u ,vSi

H2

I2
maximize E1

3. Clustering Schemes for Cluster Split
For Repeated bisection partitional clustering
method, there are a number of clustering schemes to
choose from. The clustering scheme determines the
cluster to be bisected next as in [4]. The available
schemes are “large”, which selects the largest cluster;
“best” which selects the cluster that leads to the best
cut; and “largess” which chooses the cluster that leads
to the best reduction in subspace size. In this paper we
investigate the three schemes experimentally and the
results are shown. The detailed results of these
experiments are omitted due to space limitation.

III.

DOCUMENT DATASETS AND
EXPERIMENTAL METHODOLODY

In this paper we have considered three
datasets whose general characteristics are summarized
in Table 2.
Table 2
Data
Number Number
Number
Number
Set
of Rows of
of Non of
Columns Zeros
Classes
terms
Classic 7094
41681
223839
4
Hitech 2301
126373
346881
6
mm
2521
126373
490062
2

www.ijera.com

low and high dimensional datasets and for analyzing
the characteristics of the various clusters. CLUTO
operates on very large set of documents as well as
number of dimensions.
CLUTO can be used to cluster documents
based on similarity/distance measures like cosine
similarity, correlation coefficient, Euclidean distance
and extended Jaccard coefficient. CLUTO provides
cluster quality details such as Entropy and Purity.
We have experimentally analyzed cluster
quality, based on entropy and purity, for the three data
sets classic, hitech and mm for the three criterion
conditions I2, E1 and H2. We have applied all the
three clustering schemes large, best and largess to all
the three datasets to find out the scheme that gives the
best cluster split.
1. Optimal Scheme for Cluster split
The first set of experiments was focused on
evaluating the quality of clusters based on entropy and
purity for the three schemes to ascertain the best cut
cluster scheme.
2. Optimal Scheme for Optimizing I2 Criterion
Function
In the second set of experiments, our aim was
to identify the clustering scheme that optimized the I2
criterion function. We carried out our analysis for the
three datasets by increasing the number of clusters
from 2 to 100. We chose I2 as the criterion function
for our study because through an experimental study
of Criterion functions, we have shown in our previous
paper [10] that the I2 function performs the best with
respect to clustering time.

IV.

RESULTS AND ANALYSIS

The results of entropy and purity obtained
using repeated bisection method for various criterion
functions are given for the three datasets in the figures
below.
CLASSIC DATA SET WITH I2 CRITERION
1
0.9
0.8

Purity-Large
Purity-Best
Purity-LargeSS
Entropy-Large
Entropy-Best
Entropy-LargeSS

0.7
0.6
0.5
0.4
0.3
0.2
0.1

0

5

10

15

20

25
30
Number of Clusters

35

40

Fig.1

The clustering tool we have used for our
experiments is CLUTO. CLUTO is used for clustering
www.ijera.com

1493 | P a g e

45

50
Kalyani Desikan et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1492-1495

www.ijera.com

MM Data set with E1 Criterion

Hitech data set with I2 criterion

1

0.75

0.9

0.7

0.8

Entropy-Large
Entropy-Best
Entropy-LargeSS
Purity-Large
Purity-Best
Purity-LargeSS

0.7

0.65
0.6

Entropy-Large
Entropy-Best
Entropy-LargeSS
Purity-Large
Purity-Best
Purity-LargeSS

0.6

0.55

0.5

0.4

0.3

0.2

0.5

0.45

0.1

0

5

10

15

20

25
30
Number of Clusters

35

40

45

0

5

10

15

20

25
30
Number of Clusters

35

40

45

50

Fig.6

50

Fig.2

Classic Data set with H2 Criterion
1

0.9

MM data set with I2 Criterion
1

Entropy-Large
Entropy-Best
Entropy-LargeSS
Purity-Large
Purity-Best
Purity-LargeSS

0.8

0.7

0.9

0.6

0.8
Purity-Large
Purity-Best
Purity-LargeSS
Entropy-Large
Entropy-Best
Entropy-LargeSS

0.7
0.6

0.5

0.4

0.3

0.2

0.5

0.1

0.4

0

5

10

15

20

0.3

25
30
Number of Clusters

35

40

45

50

Fig.7

0.2
Hitech Data set with H2 Criterion

0.1

0.7

0

5

10

15

20

25
30
Number of Clusters

35

40

45

50

Fig.3

0.65
Entropy-Large
Entropy-Best
Entropy-LargeSS
Purity-Large
Purity-Best
Purity-LargeSS

0.6

Classic Data set with E1 criterion
1
0.9

0.55

0.8
Entropy-Large
Entropy-Best
Entropy-LargeSS
Purity-Large
Purity-Best
Purity-LargeSS

0.7
0.6

0.5

0.45

0

5

10

15

20

25
30
Number of Clusters

35

40

45

50

0.5

Fig.8

0.4
0.3

MM Data set with H2 Criterion
1

0.2
0.9

0.1

0

5

10

15

20

25
30
Number of Clusters

35

40

45

50

Entropy-Large
Entropy-Best
Entropy-LargeSS
Purity-Large
Purity-Best
Purity-LargeSS

0.8

0.7

Fig.4

0.6

0.5

Hitech Data set with E1 Criterion
0.75

0.4

Entropy-Large
Entropy-Best
Entropy-LargeSS
Purity-Large
Purity-Best
Purity-LargeSS

0.7

0.3

0.2

0.65

0.1

5

10

15

20

25
30
Number of Clusters

35

40

45

50

Fig.9

0.6

0.55

0.5

0.45

0

0

5

10

15

20

25
30
Number of Clusters

35

40

45

50

Fig.1 through Fig.9 show the behavior of
entropy and purity as we increase the number of
clusters for the three data sets classic, hitech and mm.
We have studied the behavior of entropy and purity by
varying the clustering scheme used for the cluster

Fig.5
www.ijera.com

1494 | P a g e
Kalyani Desikan et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1492-1495
split. We have analyzed for the three clustering
schemes: large, best and largess.
Our results show that in most of the cases
there is no significant difference between the three
clustering schemes in terms of quality of the clusters
obtained. Nevertheless, the “best” scheme is seen to
be marginally better than the other two schemes.
Fig.10 to Fig.12 show the behaviour of the I2 criterion
function for the three schemes: best, large and largess
as we increase the number of clusters, for the three
datasets, classic, hitech and mm respectively. The
Number of clusters is taken along x-axis and criterion
condition I2 is taken along y-axis.
3.00E+03
best
large
largess
2.50E+03

V.

REFERENCES
[1]

[2]

1.50E+03
1.00E+03
[3]

0.00E+00
2

10

1.00E+03

best

20

30 40
Fig.10

large

60

80 100
[4]

largess

[5]

8.00E+02
6.00E+02
4.00E+02

[6]

2.00E+02
0.00E+00
2

1.50E+03

10

best

20

30 40
Fig.11

large

60

80 100

largess

[7]

[8]

1.00E+03
[9]

5.00E+02

[10]

0.00E+00
2

10 20 30 40 60 80 100
Fig.12
It can be seen that the I2 criterion functional
value continues to increase as we increase the number
of clusters. Also, across all the three datasets the I2
criterion function attains a maximum value only when
we use the‘best’ clustering scheme for the cluster split.
www.ijera.com

CONCLUSION

We have experimentally shown that we can
get better quality clusters, evaluated in terms of
entropy and purity, by applying the ‘best’ clustering
scheme. But from the graphs it can also be seen that
the cluster quality obtained by using this scheme is
only marginally better than that obtained by applying
the other two schemes.
We have also shown that the I2 criterion
function is maximized, irrespective of the number of
clusters, when the ‘best’ clustering scheme is used.

2.00E+03

5.00E+02

www.ijera.com

C. Charu, C. Stephen, S. Philip, On the Merits of
Building Categorization Systems by Supervised
Clustering. In Proceeding of the fifth ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining. pp. 352-356, 1999.
Y. Zhao, G. Karypis, “Criterion functions for
document clustering: Experiments and analysis”.
Technical Report TR #01–40, Department of
Computer Science, University of Minnesota,
Minneapolis, MN, Feb 2002.
Y. Zhao, G. Karypis, “Comparison of
Agglomerative
and
Partitional
Document
Clustering Algorithms”. The SIAM workshop on
Clustering High-dimensional Data and Its
Applications, Washington, DC, April 2002.
George Karypis. CLUTO: A Clustering Toolkit.
Technical Report: #02-017. University of
Minnesota, Department of Computer Science.
November 28, 2003
Michael Steinbach, George Karypis, Vipin Kumar.
A Comparison of Document Clustering
Techniques.
Technical
Report
#00-034.
Department
of
Computer
Science
and
Engineering. University of Minnesota. USA.
Anna Huang, university of Waikato, Newzealand,
Similarity Measures for Text Document
Clustering. In proceedings of the New Zealand
Computer Science Research Student Conference
2008
Van de Cruys, Tim “Mining for meaning: the
extraction of lexico-semantic knowledge from text”
Dissertation, Evaluation of cluster quality, chapter
6 , University of Groningen , 2010 .
Y. Zhao, G. Karypis. “Soft Clustering Criterion
Functions for Partitional Document Clustering”.
Technical Report #04-022, Department of
Computer Science, University of Minnesota,
Minneapolis,
MN,
2002
http://www.cs.umn.edu/˜karypis
Fathi H.Saad, B.de la Iglesia and G.D.Bell. A
Comparison of Two Document Clustering
Approaches for Clustering Medical Documents.
Conference on Data Mining (DMIN 2006)
G.Hannah Grace, Kalyani Desikan, Estimation of
the number of clusters based on cluster qualitySubmitted to Journal .April 2013

1495 | P a g e

More Related Content

What's hot

An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
Adam Fausett
 
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
IJDKP
 

What's hot (20)

K means report
K means reportK means report
K means report
 
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimization
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
 
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
 
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...
Parametric Comparison of K-means and Adaptive K-means Clustering Performance ...
 
50120140505013
5012014050501350120140505013
50120140505013
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
 
A046010107
A046010107A046010107
A046010107
 
K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...
K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...
K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term ...
 
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type  Data MK-Prototypes: A Novel Algorithm for Clustering Mixed Type  Data
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data
 
Analysis and implementation of modified k medoids
Analysis and implementation of modified k medoidsAnalysis and implementation of modified k medoids
Analysis and implementation of modified k medoids
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Data clustering
Data clustering Data clustering
Data clustering
 
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering TechniquesIRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
 
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
 

Viewers also liked

Ай лайк продакшн
Ай лайк продакшнАй лайк продакшн
Ай лайк продакшн
An_vilkova
 
Como sacar notas (saber el ser y el convivir) Diana Murillo
Como sacar notas (saber el ser y el convivir) Diana MurilloComo sacar notas (saber el ser y el convivir) Diana Murillo
Como sacar notas (saber el ser y el convivir) Diana Murillo
dianamurillo0305
 
3era paradigmas y-modelos-educativos
3era paradigmas y-modelos-educativos3era paradigmas y-modelos-educativos
3era paradigmas y-modelos-educativos
deliacarvache
 

Viewers also liked (20)

Iy3515411552
Iy3515411552Iy3515411552
Iy3515411552
 
Jb3515641568
Jb3515641568Jb3515641568
Jb3515641568
 
Jo3516521658
Jo3516521658Jo3516521658
Jo3516521658
 
Is3515091514
Is3515091514Is3515091514
Is3515091514
 
Iu3515201523
Iu3515201523Iu3515201523
Iu3515201523
 
Jp3516591662
Jp3516591662Jp3516591662
Jp3516591662
 
Io3514861491
Io3514861491Io3514861491
Io3514861491
 
Ir3515031508
Ir3515031508Ir3515031508
Ir3515031508
 
It3515151519
It3515151519It3515151519
It3515151519
 
Chem in news
Chem in newsChem in news
Chem in news
 
jeevitam nerpina paatham
jeevitam nerpina paatham jeevitam nerpina paatham
jeevitam nerpina paatham
 
A avaliação da aprendizagem como um princípio no desenvolvimento da autoria
A avaliação da aprendizagem como um princípio no desenvolvimento da autoriaA avaliação da aprendizagem como um princípio no desenvolvimento da autoria
A avaliação da aprendizagem como um princípio no desenvolvimento da autoria
 
2nd SDN Interest Group Seminar-Session3 (121218)
2nd SDN Interest Group Seminar-Session3 (121218)2nd SDN Interest Group Seminar-Session3 (121218)
2nd SDN Interest Group Seminar-Session3 (121218)
 
Célula! - aula 4
Célula! - aula 4Célula! - aula 4
Célula! - aula 4
 
Test
TestTest
Test
 
Ай лайк продакшн
Ай лайк продакшнАй лайк продакшн
Ай лайк продакшн
 
Como sacar notas (saber el ser y el convivir) Diana Murillo
Como sacar notas (saber el ser y el convivir) Diana MurilloComo sacar notas (saber el ser y el convivir) Diana Murillo
Como sacar notas (saber el ser y el convivir) Diana Murillo
 
Revisão
RevisãoRevisão
Revisão
 
3era paradigmas y-modelos-educativos
3era paradigmas y-modelos-educativos3era paradigmas y-modelos-educativos
3era paradigmas y-modelos-educativos
 
Diversidade Cultural
Diversidade CulturalDiversidade Cultural
Diversidade Cultural
 

Similar to Ip3514921495

Grds international conference on pure and applied science (9)
Grds international conference on pure and applied science (9)Grds international conference on pure and applied science (9)
Grds international conference on pure and applied science (9)
Global R & D Services
 
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
ijseajournal
 

Similar to Ip3514921495 (20)

20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
Grds international conference on pure and applied science (9)
Grds international conference on pure and applied science (9)Grds international conference on pure and applied science (9)
Grds international conference on pure and applied science (9)
 
F04463437
F04463437F04463437
F04463437
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
 
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
 
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
 
Finding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsFinding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster Results
 
Max stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemMax stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problem
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
 
I017235662
I017235662I017235662
I017235662
 
Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...
 
CCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression DataCCC-Bicluster Analysis for Time Series Gene Expression Data
CCC-Bicluster Analysis for Time Series Gene Expression Data
 
Bj24390398
Bj24390398Bj24390398
Bj24390398
 
Cg33504508
Cg33504508Cg33504508
Cg33504508
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering Ensemble
 
ClustIII.ppt
ClustIII.pptClustIII.ppt
ClustIII.ppt
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Ip3514921495

  • 1. Kalyani Desikan et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1492-1495 RESEARCH ARTICLE www.ijera.com OPEN ACCESS Optimal Clustering Scheme For Repeated Bisection Partitional Algorithm Kalyani Desikan *, G. Hannah Grace ** *(Department of Mathematics, VIT University, Chennai) ** (Department of Mathematics, VIT University, Chennai) ABSTRACT Text clustering divides a set of texts into clusters such that texts within each cluster are similar in content. It may be used to uncover the structure and content of unknown text sets as well as to give new perspectives on familiar ones. The focus of this paper is to experimentally evaluate the quality of clusters obtained using partitional clustering algorithms that employ different clustering schemes. The optimal clustering scheme that gives clusters of better quality is identified for three standard data sets. Also, the ideal clustering scheme that optimizes the I2 criterion function is experimentally identified. Keywords – Cluster quality, Criterion Functions, Entropy and Purity k I. for k clusters, we have INTRODUCTION The partitional clustering algorithms are well suited for clustering large document datasets due to their relatively low computational requirements according to study conducted by [1]. A report by [2] investigated the effect of the criterion functions to the problem of partitional clustering of documents and the results showed that different criterion functions lead to substantially different results. Another study reported by [3] examined the effect of the criterion functions on clustering document datasets using partitional and agglomerative clustering algorithms. Their results showed that partitional algorithms always led to better clustering results than agglomerative algorithms The main focus of this paper is to perform experimental evaluation of various criterion functions in the context of the partitional approach, namely the repeated bisection clustering algorithm. II. PRELIMINARIES 1.Cluster Quality The quality of the clusters produced is measured using two external measures, namely, entropy [3][4][5] and purity . Entropy measures how various classes of documents are distributed within each cluster. The smaller the entropy values, better the clustering solution. Given a particular cluster S r of size nr, the entropy of this cluster is defined in [6] as E(S r )   ni 1 q nri log r  log q i 1 nr nr i where q is the number of classes in the dataset and n r is the number of documents of the ith class that are assigned to the rth cluster. The entropy of the entire clustering is then the sum of the individual cluster entropies weighted according to the cluster size ,that is www.ijera.com nr E (S r ) . r 1 n Entro py Smaller entropy values indicate better clustering solutions. Using the same Mathematical notation, the purity of a cluster is defined as in [7] Pu ( S r )  1 max nri . The purity of the clustering nr solution is again the weighted sum of the individual k cluster purities, Purity   r 1 nr Pu ( S r ) . Larger n purity value indicates better clustering solution. 2.Clustering Criterion Functions Seven criterion conditions I1, I2, E1, G1,G1p ,H1 and H2 are mentioned in [2]. Theoretical analysis of the criterion functions by [8] shows that their relative performance depends on the (i) degree to which they can correctly operate when the clusters are of different tightness, and (ii) degree to which they can lead to reasonably balanced clusters. The main role of different clustering criterion functions is to determine which cluster to bisect next as discussed in [9]. In our experimental study, we have restricted our analysis to three criterion functions viz. I2, E1 and H2. We did not consider I1 since the only difference between I1 and I2 is that while calculating I2 we take the square root of the similarity function. We also ignore G1 and G1p since G1 is similar to E1 except that there is no square root in the denominator and G1p is similar to E1 except that we have n12 and that there is no square root in the denominator. Also, since H1 is a hybrid function that maximizes I1/E1, we ignore it as we have not taken I1 and E1 into consideration. I2 is an example of an internal criterion function that maximizes the similarity between each 1492 | P a g e
  • 2. Kalyani Desikan et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1492-1495 document and the centroid of the cluster to which it is assigned. E1 is an external criterion function that focuses on optimizing a function that depends on the dissimilarity of the clusters. E1 tries to minimize the similarity between the centroid vector of each cluster and the centroid vector of the entire collection. The contribution of each cluster is weighted based on the cluster size. Combinations of different clustering criterion functions provides a set of hybrid criterion functions that simultaneously optimize multiple individual criterion functions, for example, H2 is obtained by combining I2 with E1. Detailed descriptions of these criterion functions can be found in [2]. Table1 gives the mathematical formulae for the criterion functions I2, E1 and H2. Table1 Criterion Formula function k I2   sim(u, v) maximize i 1 u ,vSi E1 k minimize n i 1 i  sim(u, v) uSi ,vS  sim(u, v) u ,vSi H2 I2 maximize E1 3. Clustering Schemes for Cluster Split For Repeated bisection partitional clustering method, there are a number of clustering schemes to choose from. The clustering scheme determines the cluster to be bisected next as in [4]. The available schemes are “large”, which selects the largest cluster; “best” which selects the cluster that leads to the best cut; and “largess” which chooses the cluster that leads to the best reduction in subspace size. In this paper we investigate the three schemes experimentally and the results are shown. The detailed results of these experiments are omitted due to space limitation. III. DOCUMENT DATASETS AND EXPERIMENTAL METHODOLODY In this paper we have considered three datasets whose general characteristics are summarized in Table 2. Table 2 Data Number Number Number Number Set of Rows of of Non of Columns Zeros Classes terms Classic 7094 41681 223839 4 Hitech 2301 126373 346881 6 mm 2521 126373 490062 2 www.ijera.com low and high dimensional datasets and for analyzing the characteristics of the various clusters. CLUTO operates on very large set of documents as well as number of dimensions. CLUTO can be used to cluster documents based on similarity/distance measures like cosine similarity, correlation coefficient, Euclidean distance and extended Jaccard coefficient. CLUTO provides cluster quality details such as Entropy and Purity. We have experimentally analyzed cluster quality, based on entropy and purity, for the three data sets classic, hitech and mm for the three criterion conditions I2, E1 and H2. We have applied all the three clustering schemes large, best and largess to all the three datasets to find out the scheme that gives the best cluster split. 1. Optimal Scheme for Cluster split The first set of experiments was focused on evaluating the quality of clusters based on entropy and purity for the three schemes to ascertain the best cut cluster scheme. 2. Optimal Scheme for Optimizing I2 Criterion Function In the second set of experiments, our aim was to identify the clustering scheme that optimized the I2 criterion function. We carried out our analysis for the three datasets by increasing the number of clusters from 2 to 100. We chose I2 as the criterion function for our study because through an experimental study of Criterion functions, we have shown in our previous paper [10] that the I2 function performs the best with respect to clustering time. IV. RESULTS AND ANALYSIS The results of entropy and purity obtained using repeated bisection method for various criterion functions are given for the three datasets in the figures below. CLASSIC DATA SET WITH I2 CRITERION 1 0.9 0.8 Purity-Large Purity-Best Purity-LargeSS Entropy-Large Entropy-Best Entropy-LargeSS 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 10 15 20 25 30 Number of Clusters 35 40 Fig.1 The clustering tool we have used for our experiments is CLUTO. CLUTO is used for clustering www.ijera.com 1493 | P a g e 45 50
  • 3. Kalyani Desikan et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1492-1495 www.ijera.com MM Data set with E1 Criterion Hitech data set with I2 criterion 1 0.75 0.9 0.7 0.8 Entropy-Large Entropy-Best Entropy-LargeSS Purity-Large Purity-Best Purity-LargeSS 0.7 0.65 0.6 Entropy-Large Entropy-Best Entropy-LargeSS Purity-Large Purity-Best Purity-LargeSS 0.6 0.55 0.5 0.4 0.3 0.2 0.5 0.45 0.1 0 5 10 15 20 25 30 Number of Clusters 35 40 45 0 5 10 15 20 25 30 Number of Clusters 35 40 45 50 Fig.6 50 Fig.2 Classic Data set with H2 Criterion 1 0.9 MM data set with I2 Criterion 1 Entropy-Large Entropy-Best Entropy-LargeSS Purity-Large Purity-Best Purity-LargeSS 0.8 0.7 0.9 0.6 0.8 Purity-Large Purity-Best Purity-LargeSS Entropy-Large Entropy-Best Entropy-LargeSS 0.7 0.6 0.5 0.4 0.3 0.2 0.5 0.1 0.4 0 5 10 15 20 0.3 25 30 Number of Clusters 35 40 45 50 Fig.7 0.2 Hitech Data set with H2 Criterion 0.1 0.7 0 5 10 15 20 25 30 Number of Clusters 35 40 45 50 Fig.3 0.65 Entropy-Large Entropy-Best Entropy-LargeSS Purity-Large Purity-Best Purity-LargeSS 0.6 Classic Data set with E1 criterion 1 0.9 0.55 0.8 Entropy-Large Entropy-Best Entropy-LargeSS Purity-Large Purity-Best Purity-LargeSS 0.7 0.6 0.5 0.45 0 5 10 15 20 25 30 Number of Clusters 35 40 45 50 0.5 Fig.8 0.4 0.3 MM Data set with H2 Criterion 1 0.2 0.9 0.1 0 5 10 15 20 25 30 Number of Clusters 35 40 45 50 Entropy-Large Entropy-Best Entropy-LargeSS Purity-Large Purity-Best Purity-LargeSS 0.8 0.7 Fig.4 0.6 0.5 Hitech Data set with E1 Criterion 0.75 0.4 Entropy-Large Entropy-Best Entropy-LargeSS Purity-Large Purity-Best Purity-LargeSS 0.7 0.3 0.2 0.65 0.1 5 10 15 20 25 30 Number of Clusters 35 40 45 50 Fig.9 0.6 0.55 0.5 0.45 0 0 5 10 15 20 25 30 Number of Clusters 35 40 45 50 Fig.1 through Fig.9 show the behavior of entropy and purity as we increase the number of clusters for the three data sets classic, hitech and mm. We have studied the behavior of entropy and purity by varying the clustering scheme used for the cluster Fig.5 www.ijera.com 1494 | P a g e
  • 4. Kalyani Desikan et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.1492-1495 split. We have analyzed for the three clustering schemes: large, best and largess. Our results show that in most of the cases there is no significant difference between the three clustering schemes in terms of quality of the clusters obtained. Nevertheless, the “best” scheme is seen to be marginally better than the other two schemes. Fig.10 to Fig.12 show the behaviour of the I2 criterion function for the three schemes: best, large and largess as we increase the number of clusters, for the three datasets, classic, hitech and mm respectively. The Number of clusters is taken along x-axis and criterion condition I2 is taken along y-axis. 3.00E+03 best large largess 2.50E+03 V. REFERENCES [1] [2] 1.50E+03 1.00E+03 [3] 0.00E+00 2 10 1.00E+03 best 20 30 40 Fig.10 large 60 80 100 [4] largess [5] 8.00E+02 6.00E+02 4.00E+02 [6] 2.00E+02 0.00E+00 2 1.50E+03 10 best 20 30 40 Fig.11 large 60 80 100 largess [7] [8] 1.00E+03 [9] 5.00E+02 [10] 0.00E+00 2 10 20 30 40 60 80 100 Fig.12 It can be seen that the I2 criterion functional value continues to increase as we increase the number of clusters. Also, across all the three datasets the I2 criterion function attains a maximum value only when we use the‘best’ clustering scheme for the cluster split. www.ijera.com CONCLUSION We have experimentally shown that we can get better quality clusters, evaluated in terms of entropy and purity, by applying the ‘best’ clustering scheme. But from the graphs it can also be seen that the cluster quality obtained by using this scheme is only marginally better than that obtained by applying the other two schemes. We have also shown that the I2 criterion function is maximized, irrespective of the number of clusters, when the ‘best’ clustering scheme is used. 2.00E+03 5.00E+02 www.ijera.com C. Charu, C. Stephen, S. Philip, On the Merits of Building Categorization Systems by Supervised Clustering. In Proceeding of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 352-356, 1999. Y. Zhao, G. Karypis, “Criterion functions for document clustering: Experiments and analysis”. Technical Report TR #01–40, Department of Computer Science, University of Minnesota, Minneapolis, MN, Feb 2002. Y. Zhao, G. Karypis, “Comparison of Agglomerative and Partitional Document Clustering Algorithms”. The SIAM workshop on Clustering High-dimensional Data and Its Applications, Washington, DC, April 2002. George Karypis. CLUTO: A Clustering Toolkit. Technical Report: #02-017. University of Minnesota, Department of Computer Science. November 28, 2003 Michael Steinbach, George Karypis, Vipin Kumar. A Comparison of Document Clustering Techniques. Technical Report #00-034. Department of Computer Science and Engineering. University of Minnesota. USA. Anna Huang, university of Waikato, Newzealand, Similarity Measures for Text Document Clustering. In proceedings of the New Zealand Computer Science Research Student Conference 2008 Van de Cruys, Tim “Mining for meaning: the extraction of lexico-semantic knowledge from text” Dissertation, Evaluation of cluster quality, chapter 6 , University of Groningen , 2010 . Y. Zhao, G. Karypis. “Soft Clustering Criterion Functions for Partitional Document Clustering”. Technical Report #04-022, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2002 http://www.cs.umn.edu/˜karypis Fathi H.Saad, B.de la Iglesia and G.D.Bell. A Comparison of Two Document Clustering Approaches for Clustering Medical Documents. Conference on Data Mining (DMIN 2006) G.Hannah Grace, Kalyani Desikan, Estimation of the number of clusters based on cluster qualitySubmitted to Journal .April 2013 1495 | P a g e