SlideShare a Scribd company logo
1 of 5
Download to read offline
International Journal of Computer Applications Technology an
New Approach for
Abhishek Patel
Department of Information & Technology,
Parul Institute of Engineering & Technology,
Vadodara, Gujarat, India
Abstract: K-means and K-medoids clustering algorithms are widely used for many practical applications. Original k
medoids algorithms select initial centroids and medoids randomly that affect the quality of the resulting clusters and sometimes it
generates unstable and empty clusters which are meaningless.
expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations.
The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centro
requirements of users and then gives better, effective and stable cluster. It also takes less execution time because it eliminates
unnecessary distance computation by using previous iteration. The new approach for k
systematically based on initial centroids. It generates stable clusters to improve accuracy.
Keywords: K-means; K-medoids; centroids; clusters;
1. INTRODUCTION
We Technology advances have made data collection easier
and faster, resulting in large, more complex, datasets
many objects and dimensions. Important information is
hidden in this data. Data Mining has become an intriguing and
interesting topic for the information extraction from such data
collection since the past decade. Furthermore there are so
many subtopics related to it that research has become a
fascination for data miners. Cluster analysis or primitive
exploration with little or no prior knowledge, consists of
research developed across a wide variety of communities.
The goal of a clustering algorithm is to group similar data
points in the same cluster while purring dissimilar data points
in different clusters. It groups the data in such a way that
inter-cluster similarity is maximized and intra
similarity is minimized. Clustering is an unsupervi
learning technique of machine learning whose purpose is to
give the ability to machine to find some hidden structure
within data.
K-means and K-medoids are widely used simplest partition
based unsupervised learning algorithms that solve the well
known clustering problem. The procedure follows a simple
and easy way to classify a given data set through a certain
number of clusters (assume k clusters) fixed a priori.
In this direction, we have tried to put our efforts in enhancing
the partition base clustering algorithm to improve accuracy
and generate better and stable cluster and also reduce its time
complexity and to efficiently scale a large data set. For this I
have used a concept of “initial centroids” which has proved to
be a batter option.
2. ANALYSIS OF EXSISTING
SYSTEM
K-means Clustering
A K-means is one of the simplest unsupervised learning
algorithms that solve the well known clustering problem. The
procedure follows a simple and easy way to classify a given
data set through a certain number of clusters (assume k
clusters) fixed a priori. The main idea is to define k centroids,
one for each cluster. These centroids should be placed in a
cunning way because of different location causes different
result. So, the better choice is to place them as much as
possible far away from each other. The next ste
each point belonging to a given data set and associate it to the
International Journal of Computer Applications Technology and Research
Volume 2– Issue 1, 1-5, 2013
K-mean and K-medoids Algorithm
Department of Information & Technology,
Parul Institute of Engineering & Technology,
Purnima Singh
Department of Computer Science & Engineering,
Parul Institute of Engineering & Technology,
Vadodara, Gujarat, India
medoids clustering algorithms are widely used for many practical applications. Original k
and medoids randomly that affect the quality of the resulting clusters and sometimes it
are meaningless. The original k-means and k-mediods algorithm is computationally
the product of the number of data items, number of clusters and the number of iterations.
The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centro
hen gives better, effective and stable cluster. It also takes less execution time because it eliminates
unnecessary distance computation by using previous iteration. The new approach for k- medoids selects initial k medoids
centroids. It generates stable clusters to improve accuracy.
medoids; centroids; clusters;
Technology advances have made data collection easier
and faster, resulting in large, more complex, datasets with
many objects and dimensions. Important information is
hidden in this data. Data Mining has become an intriguing and
interesting topic for the information extraction from such data
collection since the past decade. Furthermore there are so
ics related to it that research has become a
fascination for data miners. Cluster analysis or primitive
exploration with little or no prior knowledge, consists of
research developed across a wide variety of communities.
s to group similar data
points in the same cluster while purring dissimilar data points
in different clusters. It groups the data in such a way that
cluster similarity is maximized and intra-cluster
similarity is minimized. Clustering is an unsupervised
learning technique of machine learning whose purpose is to
give the ability to machine to find some hidden structure
medoids are widely used simplest partition
based unsupervised learning algorithms that solve the well
clustering problem. The procedure follows a simple
and easy way to classify a given data set through a certain
number of clusters (assume k clusters) fixed a priori.
efforts in enhancing
ering algorithm to improve accuracy
and generate better and stable cluster and also reduce its time
complexity and to efficiently scale a large data set. For this I
have used a concept of “initial centroids” which has proved to
OF EXSISTING
is one of the simplest unsupervised learning
algorithms that solve the well known clustering problem. The
procedure follows a simple and easy way to classify a given
data set through a certain number of clusters (assume k
idea is to define k centroids,
one for each cluster. These centroids should be placed in a
cunning way because of different location causes different
result. So, the better choice is to place them as much as
possible far away from each other. The next step is to take
each point belonging to a given data set and associate it to the
nearest centroid. When no point is pending, the
completed and an early group age is done. At this point, this
method needs to re-calculate k new centroids as baryce
of the clusters resulting from the previous step. After these k
new centroids, a new binding has to be done between the
same data set points and the nearest new centroid. A loop has
been generated. As a result of this loop we may notice that the
k centroids change their location step by step until no more
changes are done. In other words centroids do not move any
more. Finally, this algorithm aims at minimizing an objective
function, in this case a squared error function. The objective
function
where is a chosen distance measure between a
data point and the cluster centre
the distance of the n data points from their respective cluster
centres.
The algorithm is composed of the following steps:
1. Place K points into the space
objects that are being clustered. These points
represent initial group centroids.
2. Assign each object to the group that has the closest
centroid.
3. When all objects have been assigned, recalculate the
positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer
move. This produces a separation of the objects into
groups from which the metric to be minimized can
be calculated.
Time Complexity & Space Complexity:
Let n = number of objects
K = number of clusters
t = number of iteration
The time complexity of k mean algorithm is O(nkt) and space
complexity is O(n+k).
medoids Algorithm
Department of Computer Science & Engineering,
Parul Institute of Engineering & Technology,
Vadodara, Gujarat, India
medoids clustering algorithms are widely used for many practical applications. Original k-mean and k-
and medoids randomly that affect the quality of the resulting clusters and sometimes it
mediods algorithm is computationally
the product of the number of data items, number of clusters and the number of iterations.
The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centroids k as per
hen gives better, effective and stable cluster. It also takes less execution time because it eliminates
medoids selects initial k medoids
nearest centroid. When no point is pending, the first step is
and an early group age is done. At this point, this
calculate k new centroids as barycenters
of the clusters resulting from the previous step. After these k
new centroids, a new binding has to be done between the
same data set points and the nearest new centroid. A loop has
been generated. As a result of this loop we may notice that the
ntroids change their location step by step until no more
changes are done. In other words centroids do not move any
more. Finally, this algorithm aims at minimizing an objective
function, in this case a squared error function. The objective
is a chosen distance measure between a
and the cluster centre , is an indicator of
data points from their respective cluster
The algorithm is composed of the following steps:
Place K points into the space represented by the
objects that are being clustered. These points
represent initial group centroids.
Assign each object to the group that has the closest
When all objects have been assigned, recalculate the
Steps 2 and 3 until the centroids no longer
move. This produces a separation of the objects into
groups from which the metric to be minimized can
:
The time complexity of k mean algorithm is O(nkt) and space
International Journal of Computer Applications Technology and Research
Volume 2– Issue 1, 1-5, 2013
www.ijcat.com 2
Observation:
k-means algorithm is a popular clustering algorithm
applied widely, but the standard algorithm which selects k
objects randomly from population as initial centroids can
not always give a good and stable clustering. Selecting
centroids by our algorithm can lead to a better clustering.
K medoids Algorithm
The k-medoids algorithm is a clustering algorithm related to
the k-means algorithm and the medoidshift algorithm. Both
the k-means and k-medoids algorithms are partitioned
(breaking the dataset up into groups) and both attempt to
minimize squared error, the distance between points labeled to
be in a cluster and a point designated as the center of that
cluster. In contrast to the k-means algorithm k-medoids
chooses data points as centers.
K-medoid is a classical partitioning technique of clustering
that clusters the data set of n objects into k clusters known a
priori. A useful tool for determining k is the silhouette. It is
more robust to noise and outliers as compared to k-means.
A medoid can be defined as the object of a cluster, whose
average dissimilarity to all the objects in the cluster is
minimal i.e. it is a most centrally located point in the given
data set.
The most common realization of k-medoid clustering is the
Partitioning Around Medoids (PAM) algorithm and is as
follows:
1. Initialize: randomly select k of the n data points as
the mediods
2. Associate each data point to the closest medoid.
("closest" here is defined using any valid distance
metric, most commonly Euclidean distance,
Manhattan distance or Minkowski distance)
3. For each mediod m,
For each non-mediod data point o Swap m and o
and compute the total cost of the
configuration.
4. Select the configuration with the lowest cost.
5. repeat steps 2 to 5 until there is no change in the
medoid.
Demonstration of PAM
Cluster the following data set of ten objects into two clusters
i.e k = 2.
Consider a data set of ten objects as follows:
Table 1 – Data point for distribution of the data
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6
Figure 1 – Distribution of the data
Step 1
Initialize k centre
Let us assume c1 = (3,4) and c2 = (7,4) So here c1 and c2 are
selected as medoid. Calculating distance so as to associate
each data object to its nearest medoid. Cost is calculated using
Minkowski distance metric with r = 1.
Table 2 – Cost Calculation for distribution of the data
c1 Data
objects
(Xi)
Cost
(distance)
3 4 2 6 3
3 4 3 8 4
3 4 4 7 4
3 4 6 2 5
3 4 6 4 3
3 4 7 3 5
3 4 8 5 6
3 4 7 6 6
c2 Data
objects
(Xi)
Cost
(distance)
7 4 2 6 7
7 4 3 8 8
7 4 4 7 6
7 4 6 2 3
7 4 6 4 1
7 4 7 3 1
7 4 8 5 2
7 4 7 6 2
Then so the clusters become:
Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
Since the points (2,6) (3,8) and (4,7) are close to c1 hence they
form one cluster whilst remaining points form another cluster.
So the total cost involved is 20.
Where cost between any two points is found using formula
where x is any data object, c is the medoid, and d is the
dimension of the object which in this case is 2. Total cost is
the summation of the cost of data object from its medoid in its
cluster so here:
total cost={cost((3,4),(2,6)+cost((3,4),(3,8))+cost((3,4),(4,7))}
+{cost((7,4),(6,2))+cost((7,4),(6,4))+cost((7,4),(7,3))}
+{cost((7,4),(8,5))+cost((7,4),(7,6))}
International Journal of Computer Applications Technology and Research
Volume 2– Issue 1, 1-5, 2013
www.ijcat.com 3
= (3+4+4) + (3+1+1+2+2)
= 20
Figure 2 – clusters after step 1
Step 2
Selection of nonmedoid O′ randomly.
Let us assume O′ = (7,3), So now the medoids are c1(3,4) and
O′(7,3). If c1 and O′ are new medoids, calculate the total cost
involved.
By using the formula in the step 1
c1 Data
objects
(Xi)
Cost
(distance)
3 4 2 6 3
3 4 3 8 4
3 4 4 7 4
3 4 6 2 5
3 4 6 4 3
3 4 7 4 4
3 4 8 5 6
3 4 7 6 6
O′ Data
objects
(Xi)
Cost
(distance)
7 3 2 6 8
7 3 3 8 9
7 3 4 7 7
7 3 6 2 2
7 3 6 4 2
7 3 7 4 1
7 3 8 5 3
7 3 7 6 3
Table 3: Cost Calculation for distribution of the data
Figure 3 – clusters after step 2
Total cost = 3+4+4+2+2+1+3+3
= 22
So cost of swapping medoid from c2 to O′ is
S = current total cost – past total cost
= 22 – 20
= 2>0.
So moving to O′ would be bad idea, so the previous choice
was good and algorithm terminates here (i.e. there is no
change in the medoids).
It may happen some data points may shift from one cluster to
another cluster depending upon their closeness to medoid. For
large values of n and k, such computations become very
costly.
3. PROPOSED PARTITION
METHOD ALGORITHM FOR NEW K-
MEAN AND K-MEDOIDS CLUSTERS
The proposed algorithm of classical Partition method is based
on initial centroid, how to assign new data point to ppropriate
cluster and calculate new centroid. In Proposed algorithm of
k-medoid how it initializes the cluster seeds means how it
choose initial medoids. The proposed algorithm is an accurate
and efficient because with this approach significant
computation is reduce by eliminating all of the distance
comparisons among points that do not fall within a common
cluster.
Also, the size of all cluster are similar, it calculate the
threshold value, which control the size of each cluster. Here
we have to provide only number of clusters and data set as an
input parameter.
3.1 Process in proposed algorithm.
The key idea behind the use of the calculation of initial
centroid is greatly improve the accuracy and algorithm
become good and more stable. Based on the initial cluster
centroid and medoid is partition the data points into number of
clusters. To calculate the initial centroid and medoid it
requires only the input data points and number of clusters k.
In second stage it also stored the previous clusters centroid
and Euclidean distance between two data points. When new
centroid is calculate then a data point is assign to new cluster
or previous cluster is decided by comparing present Euclidean
distances between two points and previous Euclidean
distance. It is greatly reduce the number of distance
calculation required for clustering.
3.2 Proposed Algorithm for New k-
means
Instead of initial centroids are selected randomly, for the
stable cluster the initial centroids are determined
systematically. It calculates the Euclidean distance between
each data point and selects two data-points between which the
distance is the shortest and form a data-point set which
contains these two data-points, then we delete them from the
population. Now find out nearest data point of this set and put
it into new set. The numbers of elements in the set are decided
by initial population and number of clusters systematically.
These ways find the different sets of data points. Numbers of
sets are depending on the value of k. Now calculate the mean
value of each sets that become initial centroid of the proposed
k mean algorithm.
After finding the initial centroids, it starts by forming the
initial clusters based on the relative distance of each data-
point from the initial centroids. These clusters are
International Journal of Computer Applications Technology and Research
Volume 2– Issue 1, 1-5, 2013
www.ijcat.com 4
subsequently fine-tuned by using a heuristic approach, thereby
improving the efficiency.
Input:
D = {d1, d2,......,dn} // set of n data items
k // Number of desired clusters
Output:
A set of k clusters
Steps:
Set p = 1
1. Compute the distance between each data point and
all other data- points in the set D
2. Find the closest pair of data points from the set D
and form a data-point set Am (1<= p <= k) which
contains these two data- points, Delete these two
data points from the set D
3. Find the data point in D that is closest to the data
point set Ap, Add it to Ap and delete it from D
4. Repeat step 4 until the number of data points in Am
reaches 0.75*(n/k)
5. If p<k, then p = p+1, find another pair of data points
from D between which the distance is the shortest,
form another data-point set Ap and delete them
from D, Go to step 4
6. For each data-point set Am (1<=p<=k) find the
arithmetic mean of the vectors of data points
Cp(1<=p<=k) in Ap, these means will be the initial
centroids
7. Compute the distance of each data-point di
(1<=i<=n) to all the centroids cj (1<=j<=k) as d(di,
cj)
8. For each data-point di, find the closest centroid cj
and assign di to cluster j
9. Set ClusterId[i]=j; // j:Id of the closest cluster
10. Set Nearest_Dist[i]= d(di, cj)
11. For each cluster j (1<=j<=k), recalculate the
centroids
12. Repeat
13. For each data-point di
14.1 Compute its distance from the centroid
of the present
nearest cluster
14.2 If this distance is less than or equal to
the present nearest
distance, the data-point stays in the cluster
Else
14.2.1 For every centroid cj
(1<=j<=k) Compute the distance
(di, cj);
Endfor
14.2.2 Assign the data-point di to the
cluster with the nearest centroid Cj
14.2.3 Set ClusterId[i] =j
14.2.4 Set Nearest_Dist[i] = d (di, cj);
Endfor
For each cluster j (1<=j<=k), recalculate the centroids; Until
the convergence Criteria is met.
Complexity Of Algorithm
Enhanced algorithm requires a time complexity of O (n 2
) for
finding the initial centroids, as the maximum time required
here is for computing the distances between each data point
and all other data-points in the set D.
In the original k-means algorithm, before the algorithm
converges the centroids are calculated many times and the
data points are assigned to their nearest centroids. Since
complete redistribution of the data points takes place
according to the new centroids, this takes O (n*k*l), where n
is the number of data-points, k is the number of clusters and l
is the number of iterations.
To obtain the initial clusters, it requires O (nk). Here, some
data points remain in its cluster while the others move to other
clusters depending on their relative distance from the new
centroid and the old centroid. This requires O (1) if a data-
point stays in its cluster and O (k) otherwise. As the algorithm
converges, the number of data points moving away from their
cluster decreases with each iteration. Assuming that half the
data point’s move from their clusters, this requires O (nk/2).
Hence the total cost of this phase of the algorithm is O (nk),
not O (nkl). Thus the overall time complexity of the improved
algorithm becomes O (n 2
), since k is much less than n.
3.3 Proposed Algorithm for new
K-medoids
Unfortunately, K-means clustering is sensitive to the outliers
and a set of objects closest to a centroid may be empty, in
which case centroids cannot be updated. For this reason, K-
medoids clustering are sometimes used, where representative
objects called medoids are considered instead of centroids.
Because it uses the most centrally located object in a cluster, it
is less sensitive to outliers compared with the K-means
clustering. Among many algorithms for K-medoids clustering,
Partitioning Around Medoids (PAM) proposed by Kaufman
and Rousseeuw (1990) is known to be most powerful.
However, PAM also has a drawback that it works inefficiently
for large data sets due to its complexity. We are interested in
developing a new K-medoids clustering method that should be
fast and efficient.
Algorithm for Select initial medoids
Input:
D = {d1, d2,......,dn} // set of n data items
K = number of desired cluster
Output:
A set of k initial centroids Cm= {C1 ,C2,…..Ck}
Steps:
1. Set p = 1
2. Compute the distance between each data point and
all other data- points in the set D
International Journal of Computer Applications Technology and Research
Volume 2– Issue 1, 1-5, 2013
www.ijcat.com 5
3. Find the closest pair of data points from the set D
and form a data-point set Am (1<= p <= k) which
contains these two data- points, Delete these two
data points from the set D
4. Find the data point in D that is closest to the data
point set Ap, Add it to Am and delete it from D
5. Repeat step 4 until the number of data points in Ap
reaches 0.75*(n/k)
6. If p<k, then p = p+1, find another pair of data points
from D between which the distance is the shortest,
form another data-point set Ap and delete them
from D, Go to step 4
7. For each data-point set Am (1<=p<=k) find the
arithmetic mean Cp ((1<=p<=k) of the vectors of
data points in Ap, these means will be the initial
centroids.
Algorithm for k-medoids
Input:
(1) Database of O objects
(2) A set of k initial centroids Cm= {C1 ,C2,…..Ck}
Output:
A set of k clusters
Steps:
1. Initialize initial medoid which is very close to
centroids {C1,C2,…Ck} of the n data points
2. Associate each data point to the closest medoid.
("closest" here is defined using any valid distance
metric, most commonly Euclidean distance,
Manhattan distance or Minkowski distance)
3. For each medoid m
For each non-medoid data point o
Swap m and o and compute the total cost of the
configuration
4. Select the configuration with the lowest cost
5. Repeat steps 2 to 5 until there is no change in the
medoid
Complexity Of Algorithm
Enhanced algorithm requires a time complexity of O (n 2
) for
finding the initial centroids, as the maximum time required
here is for computing the distances between each data point
and all other data-points in the set D. Complexity of
remaining part of the algorithm is O(k(n-k)2
) because it is just
like PAM algorithm. So over all complexity of algorithm is
O(n2
) , since k is much less than n.
4. CONCLUSION
The k-means algorithm is widely used for clustering large sets
of data. But the standard algorithm do not always guarantee
good results as the accuracy of the final clusters depend on the
selection of initial centroids. Moreover, the computational
complexity of the standard algorithm is objectionably high
owing to the need to reassign the data points a number of
times, during every iteration of the loop.
An enhanced k-means algorithm which combines a
systematic method for finding initial centroids and an efficient
way for assigning data points to clusters. This method ensures
the entire process of clustering in O(n2
) time without
sacrificing the accuracy of clusters. The previous
improvements of the k-means algorithm compromise on either
accuracy or efficiency.
The Proposed k medoid algorithm runs just like K-means
clustering algorithm. The proposed algorithm is used
systematic method of choosing the initial medoids. The
performance of the algorithm may vary according to the
method of selecting the initial medoids. It is more efficient
than existing k medoid. Time complexity of clustering is
O(n2
) time without sacrificing the accuracy of clusters.
5. FUTURE WORK
In new Approach of classical partition based clustering
algorithm, the value of k (the number of desired clusters) is
given as input parameter, regardless of distribution of the data
points. It would be better to develop some statistical methods
to compute the value of k, depending on the data distribution.
6. REFERENCES
[1] Dechang Pi, Xiaolin Qin and Qiang Wang, “Fuzzy
Clustering Algorithm Based on Tree for Association
Rules”, International Journal of Information Technology,
vol.12, No. 3, 2006.
[2] Fahim A.M., Salem A.M., “Efficient enhanced k-means
clustering algorithm”, Journal of Zhejiang University
Science, 1626 – 1633, 2006.
[3] Fang Yuag, Zeng Hui Meng, “A New Algorithm to get
initial centroid”, Third International Conference on
Machine Learning and cybernetics, Shanghai, 26-29
August,1191 – 1193, 2004.
[4] Friedrich Leisch1 and Bettina Gr un2, “Extending
Standard Cluster Algorithms to Allow for Group
Constraints”
[5] Maria Camila N. Barioni, Humberto L. Razente, Agma J.
M. Traina, “An efficient approach to scale up k-medoid
based algorithms in large databases”, 265 – 279.
[6] Michel Steinbach, Levent Ertoz and Vipin Kumar,
“Challenges in high dimensional data set”.
[7] Zhexue Huang, “A Fast Clustering Algorithm to Cluster
Very Large Categorical Data Sets in Data Mining”.
[8] Rui Xu, Donlad Wunsch, “Survey of Clustering
Algorithm”, IEEE Transactions on Neural Networks,
Vol. 16, No. 3, may 2005.
[9] Vance Febre, “Clustering and Continues k-mean
algorithm”, Los Alamos Science, No. 22,138 – 144.

More Related Content

What's hot

K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 

What's hot (20)

K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jax
 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clustering
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
Lect4
Lect4Lect4
Lect4
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
K means clustring @jax
K means clustring @jaxK means clustring @jax
K means clustring @jax
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
 
K means clustering
K means clusteringK means clustering
K means clustering
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
"k-means-clustering" presentation @ Papers We Love Bucharest
"k-means-clustering" presentation @ Papers We Love Bucharest"k-means-clustering" presentation @ Papers We Love Bucharest
"k-means-clustering" presentation @ Papers We Love Bucharest
 
Clustering
ClusteringClustering
Clustering
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 

Similar to New Approach for K-mean and K-medoids Algorithm

Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
IAEME Publication
 
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
journalBEEI
 

Similar to New Approach for K-mean and K-medoids Algorithm (20)

Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
 
K means report
K means reportK means report
K means report
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
The improved k means with particle swarm optimization
The improved k means with particle swarm optimizationThe improved k means with particle swarm optimization
The improved k means with particle swarm optimization
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
Presentation Template__TY_AIML_IE2_Project (1).pptx
Presentation Template__TY_AIML_IE2_Project (1).pptxPresentation Template__TY_AIML_IE2_Project (1).pptx
Presentation Template__TY_AIML_IE2_Project (1).pptx
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Analysis and implementation of modified k medoids
Analysis and implementation of modified k medoidsAnalysis and implementation of modified k medoids
Analysis and implementation of modified k medoids
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
 
Noura2
Noura2Noura2
Noura2
 
Machine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with RMachine Learning, K-means Algorithm Implementation with R
Machine Learning, K-means Algorithm Implementation with R
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 

More from Editor IJCATR

Integrated System for Vehicle Clearance and Registration
Integrated System for Vehicle Clearance and RegistrationIntegrated System for Vehicle Clearance and Registration
Integrated System for Vehicle Clearance and Registration
Editor IJCATR
 
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
Editor IJCATR
 
Hangul Recognition Using Support Vector Machine
Hangul Recognition Using Support Vector MachineHangul Recognition Using Support Vector Machine
Hangul Recognition Using Support Vector Machine
Editor IJCATR
 

More from Editor IJCATR (20)

Text Mining in Digital Libraries using OKAPI BM25 Model
 Text Mining in Digital Libraries using OKAPI BM25 Model Text Mining in Digital Libraries using OKAPI BM25 Model
Text Mining in Digital Libraries using OKAPI BM25 Model
 
Green Computing, eco trends, climate change, e-waste and eco-friendly
Green Computing, eco trends, climate change, e-waste and eco-friendlyGreen Computing, eco trends, climate change, e-waste and eco-friendly
Green Computing, eco trends, climate change, e-waste and eco-friendly
 
Policies for Green Computing and E-Waste in Nigeria
 Policies for Green Computing and E-Waste in Nigeria Policies for Green Computing and E-Waste in Nigeria
Policies for Green Computing and E-Waste in Nigeria
 
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...
Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Sce...
 
Optimum Location of DG Units Considering Operation Conditions
Optimum Location of DG Units Considering Operation ConditionsOptimum Location of DG Units Considering Operation Conditions
Optimum Location of DG Units Considering Operation Conditions
 
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
Analysis of Comparison of Fuzzy Knn, C4.5 Algorithm, and Naïve Bayes Classifi...
 
Web Scraping for Estimating new Record from Source Site
Web Scraping for Estimating new Record from Source SiteWeb Scraping for Estimating new Record from Source Site
Web Scraping for Estimating new Record from Source Site
 
Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...
 Evaluating Semantic Similarity between Biomedical Concepts/Classes through S... Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...
Evaluating Semantic Similarity between Biomedical Concepts/Classes through S...
 
Semantic Similarity Measures between Terms in the Biomedical Domain within f...
 Semantic Similarity Measures between Terms in the Biomedical Domain within f... Semantic Similarity Measures between Terms in the Biomedical Domain within f...
Semantic Similarity Measures between Terms in the Biomedical Domain within f...
 
A Strategy for Improving the Performance of Small Files in Openstack Swift
 A Strategy for Improving the Performance of Small Files in Openstack Swift  A Strategy for Improving the Performance of Small Files in Openstack Swift
A Strategy for Improving the Performance of Small Files in Openstack Swift
 
Integrated System for Vehicle Clearance and Registration
Integrated System for Vehicle Clearance and RegistrationIntegrated System for Vehicle Clearance and Registration
Integrated System for Vehicle Clearance and Registration
 
Assessment of the Efficiency of Customer Order Management System: A Case Stu...
 Assessment of the Efficiency of Customer Order Management System: A Case Stu... Assessment of the Efficiency of Customer Order Management System: A Case Stu...
Assessment of the Efficiency of Customer Order Management System: A Case Stu...
 
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
Energy-Aware Routing in Wireless Sensor Network Using Modified Bi-Directional A*
 
Security in Software Defined Networks (SDN): Challenges and Research Opportun...
Security in Software Defined Networks (SDN): Challenges and Research Opportun...Security in Software Defined Networks (SDN): Challenges and Research Opportun...
Security in Software Defined Networks (SDN): Challenges and Research Opportun...
 
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
 
Hangul Recognition Using Support Vector Machine
Hangul Recognition Using Support Vector MachineHangul Recognition Using Support Vector Machine
Hangul Recognition Using Support Vector Machine
 
Application of 3D Printing in Education
Application of 3D Printing in EducationApplication of 3D Printing in Education
Application of 3D Printing in Education
 
Survey on Energy-Efficient Routing Algorithms for Underwater Wireless Sensor ...
Survey on Energy-Efficient Routing Algorithms for Underwater Wireless Sensor ...Survey on Energy-Efficient Routing Algorithms for Underwater Wireless Sensor ...
Survey on Energy-Efficient Routing Algorithms for Underwater Wireless Sensor ...
 
Comparative analysis on Void Node Removal Routing algorithms for Underwater W...
Comparative analysis on Void Node Removal Routing algorithms for Underwater W...Comparative analysis on Void Node Removal Routing algorithms for Underwater W...
Comparative analysis on Void Node Removal Routing algorithms for Underwater W...
 
Decay Property for Solutions to Plate Type Equations with Variable Coefficients
Decay Property for Solutions to Plate Type Equations with Variable CoefficientsDecay Property for Solutions to Plate Type Equations with Variable Coefficients
Decay Property for Solutions to Plate Type Equations with Variable Coefficients
 

Recently uploaded

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Recently uploaded (20)

Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 

New Approach for K-mean and K-medoids Algorithm

  • 1. International Journal of Computer Applications Technology an New Approach for Abhishek Patel Department of Information & Technology, Parul Institute of Engineering & Technology, Vadodara, Gujarat, India Abstract: K-means and K-medoids clustering algorithms are widely used for many practical applications. Original k medoids algorithms select initial centroids and medoids randomly that affect the quality of the resulting clusters and sometimes it generates unstable and empty clusters which are meaningless. expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations. The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centro requirements of users and then gives better, effective and stable cluster. It also takes less execution time because it eliminates unnecessary distance computation by using previous iteration. The new approach for k systematically based on initial centroids. It generates stable clusters to improve accuracy. Keywords: K-means; K-medoids; centroids; clusters; 1. INTRODUCTION We Technology advances have made data collection easier and faster, resulting in large, more complex, datasets many objects and dimensions. Important information is hidden in this data. Data Mining has become an intriguing and interesting topic for the information extraction from such data collection since the past decade. Furthermore there are so many subtopics related to it that research has become a fascination for data miners. Cluster analysis or primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The goal of a clustering algorithm is to group similar data points in the same cluster while purring dissimilar data points in different clusters. It groups the data in such a way that inter-cluster similarity is maximized and intra similarity is minimized. Clustering is an unsupervi learning technique of machine learning whose purpose is to give the ability to machine to find some hidden structure within data. K-means and K-medoids are widely used simplest partition based unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. In this direction, we have tried to put our efforts in enhancing the partition base clustering algorithm to improve accuracy and generate better and stable cluster and also reduce its time complexity and to efficiently scale a large data set. For this I have used a concept of “initial centroids” which has proved to be a batter option. 2. ANALYSIS OF EXSISTING SYSTEM K-means Clustering A K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next ste each point belonging to a given data set and associate it to the International Journal of Computer Applications Technology and Research Volume 2– Issue 1, 1-5, 2013 K-mean and K-medoids Algorithm Department of Information & Technology, Parul Institute of Engineering & Technology, Purnima Singh Department of Computer Science & Engineering, Parul Institute of Engineering & Technology, Vadodara, Gujarat, India medoids clustering algorithms are widely used for many practical applications. Original k and medoids randomly that affect the quality of the resulting clusters and sometimes it are meaningless. The original k-means and k-mediods algorithm is computationally the product of the number of data items, number of clusters and the number of iterations. The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centro hen gives better, effective and stable cluster. It also takes less execution time because it eliminates unnecessary distance computation by using previous iteration. The new approach for k- medoids selects initial k medoids centroids. It generates stable clusters to improve accuracy. medoids; centroids; clusters; Technology advances have made data collection easier and faster, resulting in large, more complex, datasets with many objects and dimensions. Important information is hidden in this data. Data Mining has become an intriguing and interesting topic for the information extraction from such data collection since the past decade. Furthermore there are so ics related to it that research has become a fascination for data miners. Cluster analysis or primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. s to group similar data points in the same cluster while purring dissimilar data points in different clusters. It groups the data in such a way that cluster similarity is maximized and intra-cluster similarity is minimized. Clustering is an unsupervised learning technique of machine learning whose purpose is to give the ability to machine to find some hidden structure medoids are widely used simplest partition based unsupervised learning algorithms that solve the well clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. efforts in enhancing ering algorithm to improve accuracy and generate better and stable cluster and also reduce its time complexity and to efficiently scale a large data set. For this I have used a concept of “initial centroids” which has proved to OF EXSISTING is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the completed and an early group age is done. At this point, this method needs to re-calculate k new centroids as baryce of the clusters resulting from the previous step. After these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function where is a chosen distance measure between a data point and the cluster centre the distance of the n data points from their respective cluster centres. The algorithm is composed of the following steps: 1. Place K points into the space objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. Time Complexity & Space Complexity: Let n = number of objects K = number of clusters t = number of iteration The time complexity of k mean algorithm is O(nkt) and space complexity is O(n+k). medoids Algorithm Department of Computer Science & Engineering, Parul Institute of Engineering & Technology, Vadodara, Gujarat, India medoids clustering algorithms are widely used for many practical applications. Original k-mean and k- and medoids randomly that affect the quality of the resulting clusters and sometimes it mediods algorithm is computationally the product of the number of data items, number of clusters and the number of iterations. The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centroids k as per hen gives better, effective and stable cluster. It also takes less execution time because it eliminates medoids selects initial k medoids nearest centroid. When no point is pending, the first step is and an early group age is done. At this point, this calculate k new centroids as barycenters of the clusters resulting from the previous step. After these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the ntroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective is a chosen distance measure between a and the cluster centre , is an indicator of data points from their respective cluster The algorithm is composed of the following steps: Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. Assign each object to the group that has the closest When all objects have been assigned, recalculate the Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can : The time complexity of k mean algorithm is O(nkt) and space
  • 2. International Journal of Computer Applications Technology and Research Volume 2– Issue 1, 1-5, 2013 www.ijcat.com 2 Observation: k-means algorithm is a popular clustering algorithm applied widely, but the standard algorithm which selects k objects randomly from population as initial centroids can not always give a good and stable clustering. Selecting centroids by our algorithm can lead to a better clustering. K medoids Algorithm The k-medoids algorithm is a clustering algorithm related to the k-means algorithm and the medoidshift algorithm. Both the k-means and k-medoids algorithms are partitioned (breaking the dataset up into groups) and both attempt to minimize squared error, the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm k-medoids chooses data points as centers. K-medoid is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori. A useful tool for determining k is the silhouette. It is more robust to noise and outliers as compared to k-means. A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the given data set. The most common realization of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: 1. Initialize: randomly select k of the n data points as the mediods 2. Associate each data point to the closest medoid. ("closest" here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) 3. For each mediod m, For each non-mediod data point o Swap m and o and compute the total cost of the configuration. 4. Select the configuration with the lowest cost. 5. repeat steps 2 to 5 until there is no change in the medoid. Demonstration of PAM Cluster the following data set of ten objects into two clusters i.e k = 2. Consider a data set of ten objects as follows: Table 1 – Data point for distribution of the data X1 2 6 X2 3 4 X3 3 8 X4 4 7 X5 6 2 X6 6 4 X7 7 3 X8 7 4 X9 8 5 X10 7 6 Figure 1 – Distribution of the data Step 1 Initialize k centre Let us assume c1 = (3,4) and c2 = (7,4) So here c1 and c2 are selected as medoid. Calculating distance so as to associate each data object to its nearest medoid. Cost is calculated using Minkowski distance metric with r = 1. Table 2 – Cost Calculation for distribution of the data c1 Data objects (Xi) Cost (distance) 3 4 2 6 3 3 4 3 8 4 3 4 4 7 4 3 4 6 2 5 3 4 6 4 3 3 4 7 3 5 3 4 8 5 6 3 4 7 6 6 c2 Data objects (Xi) Cost (distance) 7 4 2 6 7 7 4 3 8 8 7 4 4 7 6 7 4 6 2 3 7 4 6 4 1 7 4 7 3 1 7 4 8 5 2 7 4 7 6 2 Then so the clusters become: Cluster1 = {(3,4)(2,6)(3,8)(4,7)} Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)} Since the points (2,6) (3,8) and (4,7) are close to c1 hence they form one cluster whilst remaining points form another cluster. So the total cost involved is 20. Where cost between any two points is found using formula where x is any data object, c is the medoid, and d is the dimension of the object which in this case is 2. Total cost is the summation of the cost of data object from its medoid in its cluster so here: total cost={cost((3,4),(2,6)+cost((3,4),(3,8))+cost((3,4),(4,7))} +{cost((7,4),(6,2))+cost((7,4),(6,4))+cost((7,4),(7,3))} +{cost((7,4),(8,5))+cost((7,4),(7,6))}
  • 3. International Journal of Computer Applications Technology and Research Volume 2– Issue 1, 1-5, 2013 www.ijcat.com 3 = (3+4+4) + (3+1+1+2+2) = 20 Figure 2 – clusters after step 1 Step 2 Selection of nonmedoid O′ randomly. Let us assume O′ = (7,3), So now the medoids are c1(3,4) and O′(7,3). If c1 and O′ are new medoids, calculate the total cost involved. By using the formula in the step 1 c1 Data objects (Xi) Cost (distance) 3 4 2 6 3 3 4 3 8 4 3 4 4 7 4 3 4 6 2 5 3 4 6 4 3 3 4 7 4 4 3 4 8 5 6 3 4 7 6 6 O′ Data objects (Xi) Cost (distance) 7 3 2 6 8 7 3 3 8 9 7 3 4 7 7 7 3 6 2 2 7 3 6 4 2 7 3 7 4 1 7 3 8 5 3 7 3 7 6 3 Table 3: Cost Calculation for distribution of the data Figure 3 – clusters after step 2 Total cost = 3+4+4+2+2+1+3+3 = 22 So cost of swapping medoid from c2 to O′ is S = current total cost – past total cost = 22 – 20 = 2>0. So moving to O′ would be bad idea, so the previous choice was good and algorithm terminates here (i.e. there is no change in the medoids). It may happen some data points may shift from one cluster to another cluster depending upon their closeness to medoid. For large values of n and k, such computations become very costly. 3. PROPOSED PARTITION METHOD ALGORITHM FOR NEW K- MEAN AND K-MEDOIDS CLUSTERS The proposed algorithm of classical Partition method is based on initial centroid, how to assign new data point to ppropriate cluster and calculate new centroid. In Proposed algorithm of k-medoid how it initializes the cluster seeds means how it choose initial medoids. The proposed algorithm is an accurate and efficient because with this approach significant computation is reduce by eliminating all of the distance comparisons among points that do not fall within a common cluster. Also, the size of all cluster are similar, it calculate the threshold value, which control the size of each cluster. Here we have to provide only number of clusters and data set as an input parameter. 3.1 Process in proposed algorithm. The key idea behind the use of the calculation of initial centroid is greatly improve the accuracy and algorithm become good and more stable. Based on the initial cluster centroid and medoid is partition the data points into number of clusters. To calculate the initial centroid and medoid it requires only the input data points and number of clusters k. In second stage it also stored the previous clusters centroid and Euclidean distance between two data points. When new centroid is calculate then a data point is assign to new cluster or previous cluster is decided by comparing present Euclidean distances between two points and previous Euclidean distance. It is greatly reduce the number of distance calculation required for clustering. 3.2 Proposed Algorithm for New k- means Instead of initial centroids are selected randomly, for the stable cluster the initial centroids are determined systematically. It calculates the Euclidean distance between each data point and selects two data-points between which the distance is the shortest and form a data-point set which contains these two data-points, then we delete them from the population. Now find out nearest data point of this set and put it into new set. The numbers of elements in the set are decided by initial population and number of clusters systematically. These ways find the different sets of data points. Numbers of sets are depending on the value of k. Now calculate the mean value of each sets that become initial centroid of the proposed k mean algorithm. After finding the initial centroids, it starts by forming the initial clusters based on the relative distance of each data- point from the initial centroids. These clusters are
  • 4. International Journal of Computer Applications Technology and Research Volume 2– Issue 1, 1-5, 2013 www.ijcat.com 4 subsequently fine-tuned by using a heuristic approach, thereby improving the efficiency. Input: D = {d1, d2,......,dn} // set of n data items k // Number of desired clusters Output: A set of k clusters Steps: Set p = 1 1. Compute the distance between each data point and all other data- points in the set D 2. Find the closest pair of data points from the set D and form a data-point set Am (1<= p <= k) which contains these two data- points, Delete these two data points from the set D 3. Find the data point in D that is closest to the data point set Ap, Add it to Ap and delete it from D 4. Repeat step 4 until the number of data points in Am reaches 0.75*(n/k) 5. If p<k, then p = p+1, find another pair of data points from D between which the distance is the shortest, form another data-point set Ap and delete them from D, Go to step 4 6. For each data-point set Am (1<=p<=k) find the arithmetic mean of the vectors of data points Cp(1<=p<=k) in Ap, these means will be the initial centroids 7. Compute the distance of each data-point di (1<=i<=n) to all the centroids cj (1<=j<=k) as d(di, cj) 8. For each data-point di, find the closest centroid cj and assign di to cluster j 9. Set ClusterId[i]=j; // j:Id of the closest cluster 10. Set Nearest_Dist[i]= d(di, cj) 11. For each cluster j (1<=j<=k), recalculate the centroids 12. Repeat 13. For each data-point di 14.1 Compute its distance from the centroid of the present nearest cluster 14.2 If this distance is less than or equal to the present nearest distance, the data-point stays in the cluster Else 14.2.1 For every centroid cj (1<=j<=k) Compute the distance (di, cj); Endfor 14.2.2 Assign the data-point di to the cluster with the nearest centroid Cj 14.2.3 Set ClusterId[i] =j 14.2.4 Set Nearest_Dist[i] = d (di, cj); Endfor For each cluster j (1<=j<=k), recalculate the centroids; Until the convergence Criteria is met. Complexity Of Algorithm Enhanced algorithm requires a time complexity of O (n 2 ) for finding the initial centroids, as the maximum time required here is for computing the distances between each data point and all other data-points in the set D. In the original k-means algorithm, before the algorithm converges the centroids are calculated many times and the data points are assigned to their nearest centroids. Since complete redistribution of the data points takes place according to the new centroids, this takes O (n*k*l), where n is the number of data-points, k is the number of clusters and l is the number of iterations. To obtain the initial clusters, it requires O (nk). Here, some data points remain in its cluster while the others move to other clusters depending on their relative distance from the new centroid and the old centroid. This requires O (1) if a data- point stays in its cluster and O (k) otherwise. As the algorithm converges, the number of data points moving away from their cluster decreases with each iteration. Assuming that half the data point’s move from their clusters, this requires O (nk/2). Hence the total cost of this phase of the algorithm is O (nk), not O (nkl). Thus the overall time complexity of the improved algorithm becomes O (n 2 ), since k is much less than n. 3.3 Proposed Algorithm for new K-medoids Unfortunately, K-means clustering is sensitive to the outliers and a set of objects closest to a centroid may be empty, in which case centroids cannot be updated. For this reason, K- medoids clustering are sometimes used, where representative objects called medoids are considered instead of centroids. Because it uses the most centrally located object in a cluster, it is less sensitive to outliers compared with the K-means clustering. Among many algorithms for K-medoids clustering, Partitioning Around Medoids (PAM) proposed by Kaufman and Rousseeuw (1990) is known to be most powerful. However, PAM also has a drawback that it works inefficiently for large data sets due to its complexity. We are interested in developing a new K-medoids clustering method that should be fast and efficient. Algorithm for Select initial medoids Input: D = {d1, d2,......,dn} // set of n data items K = number of desired cluster Output: A set of k initial centroids Cm= {C1 ,C2,…..Ck} Steps: 1. Set p = 1 2. Compute the distance between each data point and all other data- points in the set D
  • 5. International Journal of Computer Applications Technology and Research Volume 2– Issue 1, 1-5, 2013 www.ijcat.com 5 3. Find the closest pair of data points from the set D and form a data-point set Am (1<= p <= k) which contains these two data- points, Delete these two data points from the set D 4. Find the data point in D that is closest to the data point set Ap, Add it to Am and delete it from D 5. Repeat step 4 until the number of data points in Ap reaches 0.75*(n/k) 6. If p<k, then p = p+1, find another pair of data points from D between which the distance is the shortest, form another data-point set Ap and delete them from D, Go to step 4 7. For each data-point set Am (1<=p<=k) find the arithmetic mean Cp ((1<=p<=k) of the vectors of data points in Ap, these means will be the initial centroids. Algorithm for k-medoids Input: (1) Database of O objects (2) A set of k initial centroids Cm= {C1 ,C2,…..Ck} Output: A set of k clusters Steps: 1. Initialize initial medoid which is very close to centroids {C1,C2,…Ck} of the n data points 2. Associate each data point to the closest medoid. ("closest" here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) 3. For each medoid m For each non-medoid data point o Swap m and o and compute the total cost of the configuration 4. Select the configuration with the lowest cost 5. Repeat steps 2 to 5 until there is no change in the medoid Complexity Of Algorithm Enhanced algorithm requires a time complexity of O (n 2 ) for finding the initial centroids, as the maximum time required here is for computing the distances between each data point and all other data-points in the set D. Complexity of remaining part of the algorithm is O(k(n-k)2 ) because it is just like PAM algorithm. So over all complexity of algorithm is O(n2 ) , since k is much less than n. 4. CONCLUSION The k-means algorithm is widely used for clustering large sets of data. But the standard algorithm do not always guarantee good results as the accuracy of the final clusters depend on the selection of initial centroids. Moreover, the computational complexity of the standard algorithm is objectionably high owing to the need to reassign the data points a number of times, during every iteration of the loop. An enhanced k-means algorithm which combines a systematic method for finding initial centroids and an efficient way for assigning data points to clusters. This method ensures the entire process of clustering in O(n2 ) time without sacrificing the accuracy of clusters. The previous improvements of the k-means algorithm compromise on either accuracy or efficiency. The Proposed k medoid algorithm runs just like K-means clustering algorithm. The proposed algorithm is used systematic method of choosing the initial medoids. The performance of the algorithm may vary according to the method of selecting the initial medoids. It is more efficient than existing k medoid. Time complexity of clustering is O(n2 ) time without sacrificing the accuracy of clusters. 5. FUTURE WORK In new Approach of classical partition based clustering algorithm, the value of k (the number of desired clusters) is given as input parameter, regardless of distribution of the data points. It would be better to develop some statistical methods to compute the value of k, depending on the data distribution. 6. REFERENCES [1] Dechang Pi, Xiaolin Qin and Qiang Wang, “Fuzzy Clustering Algorithm Based on Tree for Association Rules”, International Journal of Information Technology, vol.12, No. 3, 2006. [2] Fahim A.M., Salem A.M., “Efficient enhanced k-means clustering algorithm”, Journal of Zhejiang University Science, 1626 – 1633, 2006. [3] Fang Yuag, Zeng Hui Meng, “A New Algorithm to get initial centroid”, Third International Conference on Machine Learning and cybernetics, Shanghai, 26-29 August,1191 – 1193, 2004. [4] Friedrich Leisch1 and Bettina Gr un2, “Extending Standard Cluster Algorithms to Allow for Group Constraints” [5] Maria Camila N. Barioni, Humberto L. Razente, Agma J. M. Traina, “An efficient approach to scale up k-medoid based algorithms in large databases”, 265 – 279. [6] Michel Steinbach, Levent Ertoz and Vipin Kumar, “Challenges in high dimensional data set”. [7] Zhexue Huang, “A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining”. [8] Rui Xu, Donlad Wunsch, “Survey of Clustering Algorithm”, IEEE Transactions on Neural Networks, Vol. 16, No. 3, may 2005. [9] Vance Febre, “Clustering and Continues k-mean algorithm”, Los Alamos Science, No. 22,138 – 144.