More Related Content
More from Ijarcsee Journal
More from Ijarcsee Journal (20)
31 34
- 1. ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 5, July 2012
Improved K-Means clustering technique using
distance determination approach
Bhanu Sukhija, Sukhvir Singh
data owner. OR Data Mining also refers to extracting or
Abstract— Powerful systems for collecting data and managing mining from large amount of data. This is also sometimes
it in large databases are in place in all large and mid-range referred to as Knowledge Discovery from Data (KDD). Data
organizations. The value of raw data (collected over a long time) mining tools predict future trends and behaviors, allowing
is on the ability to extract high-level information: information businesses to make proactive, knowledge-driven decisions.
useful for decision support, for exploration, and for better Data mining tools can answer business questions that
understanding of the phenomena generating the data.
traditionally were time consuming to resolve.
Traditionally this task of extracting information was done with
the help of analysis where one or more analysts with the help of
statistical techniques provide summaries and generate reports.
Clustering, the grouping of the objects of a database into
Such an approach fails as the volume and dimensionality of the meaningful subclasses such that similarity of objects in the
data increase. Hence tools to aid the automation of analysis same group is maximized and similarity of objects in
tasks are becoming a necessity. Thus, data mining was evolved different groups is minimized, is called clustering. Thus the
which is “automatic”: extraction of patterns of information objects are clustered or grouped based on the principle of
from data. Clustering, the grouping of the objects of a database ―maximizing the intra class similarity and minimizing the
into meaningful subclasses such that similarity of objects in the interclass similarity‖. Clustering has its roots in many areas,
same group is maximized and similarity of objects in different including data mining, statistics, biology, and machine
groups is minimized, is called clustering But my focus is on
learning. Clustering methods can be divided into various
partitioning methods only , and proposing a new clustering
algorithm for automatic discovery of data clusters
types: Partitioning methods, Hierarchical methods, Density
based methods, Grid-based methods, Model based methods,
Index Terms — data mining, data sets, clustering techniques Probabilistic techniques, Graph theoretic and Fuzzy methods.
k-Mean. But focus of thesis is on partitioning methods only, and
proposing a new clustering algorithm for automatic
discovery of data clusters.
I. INTRODUCTION A. Key Management and LAN
Importance of collecting the data that reflect business or Many international organizations produce a huge quantity of
scientific activities to achieve competitive advantage is information in a week than many other persons could read in
widely recognized now. Powerful systems for collecting data a lifetime. The situation is even more alarming in worldwide
and managing it in large databases are in place in all large and networks like Internet. Everyday hundreds of megabytes of
mid-range organizations. The value of raw data (collected data are distributed around the world, but it is no longer
over a long time) is on the ability to extract high-level possible to monitor this increasingly rapid development – the
information: information useful for decision support, for growth is nearly exponential. So the basic problems with the
exploration, and for better understanding of the phenomena management of the data can be summarized below:
generating the data. Traditionally this task of extracting
information was done with the help of analysis where one or Analysis required unearthing the hidden
more analysts with the help of statistical techniques provide relationships within the data i.e. for decision
summaries and generate reports. Such an approach fails as support.
the volume and dimensionality of the data increase. Hence
tools to aid the automation of analysis tasks are becoming a DBMS gave access to the data stored but no
necessity. Thus, data mining was evolved which is analysis of data.
―automatic‖: extraction of patterns of information from data. Size of databases has increased and needs automated
techniques for analysis as they have grown beyond manual
Data mining is the analysis of large observational data sets to
extraction.
find unsuspected relationships and to summarize the data in
novel ways that are both understandable and useful to the
Manuscript received july 20, 2012. II. METHODOLOGY
Bhanu Sukhija, computer science, kurukshetra university y/ N.C College/
N.C and S.D group of institutions, (e-mail:bhanu_sukhija57@yahoo.com).
As already stated, dissertation is based on methodology of
Panipat, Haryana, India, 07206089757 Iterative Relocating Technique of Partitional Clustering i.e.
Sukhvir Singh, computer science department , Kurukshetra university/ N.C k-means and k-medoids. All the algorithms are implemented
College/ N.C and S.D group of institutions, Panipat,, Haryana, India, , in C++ language and results are studied in same environment.
(e-mail: boora_s@yahoomail.com).
But, before going through the study of these algorithms, their
31
All Rights Reserved © 2012 IJARCSEE
- 2. ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 5, July 2012
implementation view and results there are following terms Let X= {xi | i =1, 2, ……n} be a data set with n objects, k is
used in the implementations of these algorithms the number of clusters, mj is the centroid of cluster cj where
j=1, 2,…k. Then the algorithm finds the distance between a
data object and a centroid by using the following Euclidean
A. Distance
distance formula [1,9,10]
Both of these algorithms are based on Distance among Euclidean distance formula= √| xi – mj |2
objects, hence it is necessary to introduce with the distance .
which we have taken in this dissertation. III. COMPARISION OF ALGORITHMS
Let X= {xi | i =1, 2, ……, n } be a data set with n objects, k is
A. Comparison of algorithm’s running time
the number of clusters, mj is the centroid of cluster cj where
j=1, 2,…k. Then the algorithm finds the distance between a
data object and a centroid by using the following Euclidean
distance formula [1, 9, 10] Name of algorithm Computational
Euclidean distance formula= √| xi – mj | 2
Complexity
Starting from an initial distribution of cluster centers in data
space, each object is assigned to the cluster with closest k-means O(nkt)
center, after which each center itself is updated as the center
of mass of all objects belonging to that particular cluster.
k-medoids O(k(n-k)2)
B. Data Set
Data set is the group of n data points on which the D-M O(ni)
partitioning approach is to be applied which is of m
dimension. Multidimensional data can include any number
of measurable attributes. Each independent characteristic, or B. Comparison of SSE (sum of square error)of Algorithm
measurement, is one dimension.The consolidation of large,
multidimensional data sets is the main purpose of the field of
cluster analysis.[7] In this dissertation, data set (Dn) is to be Name of Two Four
used of two dimensional and having twenty data points. Each Algorithm
data points is represented by a pixel (x, y) where x is
represented data point’s first dimension and y is represented k-means 1053.6022 681.4046
by data point’s second dimension respectively. Visualization
of data set are also produced in the form of set of twenty
pixels. k-medoids 1082.002 464.5469
C. K-Means Step D-M 1053.6022 392.6812
The basic step of k-means clustering is to give the number of
clusters k and consider first k objects from data set D as
clusters & their centroid. Each object in each cluster, the distance from object to its
Then the K-means algorithm will do the three steps below cluster center is squared and the distances are summed. In
until convergence
Iterate until stable (= no object move group): other words, square error is the sum of distance for all objects
in the data set with respect to the centroid or medoid of
1. Determine the centroid coordinate
2. Determine the distance of each object to the centroid belonging to that object’s cluster. It can also be defined as:
3. Group the object based on minimum Distance.
E = j=1 k ( i=1 n |pij – m j |2)
where E is the sum of square error for all objects in the data
D. K-Medoids or partition around medoid (PAM) step: set; pij is the point in ith object in jth space/cluster representing
The basic step of K-Medoid or PAM clustering is to give the a given object, mj is the mean of jth cluster (both pij and mj are
number of clusters k and consider first k objects from data set multidimensional). This criterion tries to make the resulting
Dn as cluster & their medoid. k clusters as compact and as separate as possible
Then the k-medoid or PAM algorithm will do the following
five steps below until convergence IV. VALIDATION
Iterative until stable (=no objects change group)
Validation is the measurement of similarity between the
1. Design clusters and Compute the Distance over all the actual classes or groups and clusters results of any clustering
clusters technique. It can be measured by way of different measures
2. Randomly select non-medoid objects as like Purity, Entropy, F-measures and Coefficient of
3. Compute the New Distance w.r.t. new Variation. In which the Coefficient of Variation is the best
measurement of Validation
32
All Rights Reserved © 2012 IJARCSEE
- 3. ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 5, July 2012
Coefficient of Variation is a measure of the data dispersion.
The CV is defined as the ratio of the Standard deviation to the
mean. CV is a dimensionless number that allows comparison
of the variation of populations that have significantly different
mean values. In general the large the CV value is , the greater
the variability is in the data. Coefficient of Variation can be
calculated as follows:
1. Find the mean of all Classes individually in the data set:
x = i=1n x i /n
where x is the mean of each class having n number of
objects.
2. Next calculate the Global Mean of All classes by using
their means :
xg = j=1N xj /N
where xg is the Global Mean of all N number of classes
and xj is the mean of that classes.
3. Find the Standard Deviation of all the Classes :
s=√ j=1N (xj – xg )2 /N
4. Find the Coefficient of Variation (CV) : Fig. Output of K-means with k=2
B. Implementation view of K-Medoids
CV = s/xg
As shown in Figure here the value of k=2 shows the Two
Where CV is the Coefficient of Variation, s is the
Clusters representing by medoids CD:1 and CD:2
Standard Deviation and xg is the Global Mean.
respectively which are also the data points of the data set.
The Deviation in clusters is shown in rights hand side of the
figure i.e. 465.2147 in 1st cluster and 616.7882 in 2nd clusters
V. RESULTS
totaling 1082.02. There are 11 data points in 1st clusters and
9 data points in 2nd cluster. It is also important to mention
The following are the result of the implementation of above here that the total distance i.e. SSE of k-medoids is trivially
said algorithms which has been implemented on compiler of more than k-means i.e. 1053.602.
C++ with different values of k i.e. the numbers of clusters on
same set of two dimensional data set having n=20. As shown in Figure 4.9 here the value of k=3 shows the
Three Clusters representing by medoids CD:1, CD:2 and
A. Implementation view of K-Means CD:3 respectively which are also the data points of the data
set. The Deviation in clusters is shown in rights hand side of
As shown in Figure here the value of k=2 shows the Two the figure i.e. 118.6493 in 1st cluster, 522.7472 in 2nd clusters
Clusters representing by centroids CD:1 and CD:2 and 145.4263 in 3rd cluster totaling 786.8229. There are 5
respectively. The Deviation in clusters is shown in rights data points in 1st clusters, 10 data points in 2nd cluster and 5
hand side of the figure i.e. 526.8011 in each cluster totaling data points in 3rd clusters. It is again considerable that the
1053.602. The numbers of data points is ten in each cluster. total distance i.e. SSE of k-medoids is trivially more than
k-means i.e. 723.1417.
33
All Rights Reserved © 2012 IJARCSEE
- 4. ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering
Volume 1, Issue 5, July 2012
[5] Hand & Others: Principal of Data Mining
[6] Pavel Brakhin: ―Survey of Clustering Data Mining
Techniques‖ .
[7] Vance Faber, ―Clustering and the Continuous k-Means
Algorithm‖.
[8] Introduction of Clustering
[9]. Shang ,Yi &Li, Guo-Jie(1991) New Crossover Operators
In Genetic Algorithms, , P. R. China: National Research
Center for Intelligent Computing Systems (NCIC)
[10]. Singh ,Vijendra & Choudhary ,Simran (2009) Genetic
Algorithm for Traveling Salesman Problem: Using Modified
Partially-Mapped CrossoverOperator,sikar,Rajasthan,India:
Department of Computer Science & Engineering, Faculty of
Engineering & Technology, Mody Institute of Technology &
Science, Lakshmangarh
[11]. Su, Fanchen et al(2009) New Crossover Operator of
Genetic Algorithms for the TSP, P.R. China: Computer
School of Wuhan University Wuhan.
Ms. Bhanu Sukhija obtained the B.Tech degree in
Computer Science and Engineering from Kurukshetra
University, Kurukshetra, India in 2009.
She is presently working as Lecturer in
Department of Computer Science and
Engineering at N. C. College of
Fig. Output of K-Medoids with k=2 Engineering, Israna, Panipat. She is
presently pursuing her M.Tech from the
same institute. She has guided several
VI. CONCLUSION B.Tech projects. Her research interests
include data mining.
In partitioning based clustering algorithms, the number of
final clusters (k) needs to be determined beforehand. Also,
algorithms have problems like susceptible to local optima, Mr. Sukhvir Singh obtained M.Tech (Integrated) degree in
sensitive to outliers, memory space, validity and unknown Software Engineering and System
number of iteration steps required to cluster. So an algorithm Analysis in 1996. He is presently
which explores number of clusters automatically considering pursuing PhD from M.D University,
all these problems is an active research area and most of the Rohtak. He has worked at P.D.M
research has focused on effectiveness and scalability of the Polytechnic, Sarai Aurangabad,
algorithm. In this paper, a partitioning based algorithm, D-M Bahadurgarh, P.D.M College of
(Density Means), clustering has been considered which Engineering, Bahadurgarh.
generate clusters automatically. He is presently heading department of computer science at
. N.C College of engineering, Israna (Panipat) and also He has
received Grant of Rs. 4 Lac from AICTE for the project of
MODROBS for Advance Computer Network Lab. He is also
ACKNOWLEDGMENT
Member of Computer Society of India and Life Time
The authors are quite thankful to the Computer science Member of ISTE.
department at NC College of Engineering for supporting our
work.
REFERENCES
[1] J. Han and M. Kamber K. Data Mining: Concepts and
Techniques. Morgan Kaufman, 2000.
[2] P.S Bradley, Usama M. Fayyad., ―Initial Points for
K-Means Clustering.‖, Advances in Knowledge Discovery
and Data Mining. MIT Press .
[3] ―Introduction to Data Mining and Knowledge
Discovery” 3rd Edition by Two Crows Corporation.
[4] Teknomo, kardi. ―K-Means clustering tutorial.”
34
All Rights Reserved © 2012 IJARCSEE