SlideShare a Scribd company logo
1 of 4
Download to read offline
ISSN: 2277 – 9043
                                                 International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                                                Volume 1, Issue 5, July 2012



        Improved K-Means clustering technique using
              distance determination approach
                                                    Bhanu Sukhija, Sukhvir Singh


                                                                          data owner. OR Data Mining also refers to extracting or
Abstract— Powerful systems for collecting data and managing                mining from large amount of data. This is also sometimes
it in large databases are in place in all large and mid-range              referred to as Knowledge Discovery from Data (KDD). Data
organizations. The value of raw data (collected over a long time)          mining tools predict future trends and behaviors, allowing
is on the ability to extract high-level information: information           businesses to make proactive, knowledge-driven decisions.
useful for decision support, for exploration, and for better               Data mining tools can answer business questions that
understanding of the phenomena generating the data.
                                                                           traditionally were time consuming to resolve.
Traditionally this task of extracting information was done with
the help of analysis where one or more analysts with the help of
statistical techniques provide summaries and generate reports.
                                                                           Clustering, the grouping of the objects of a database into
Such an approach fails as the volume and dimensionality of the             meaningful subclasses such that similarity of objects in the
data increase. Hence tools to aid the automation of analysis               same group is maximized and similarity of objects in
tasks are becoming a necessity. Thus, data mining was evolved              different groups is minimized, is called clustering. Thus the
which is “automatic”: extraction of patterns of information                objects are clustered or grouped based on the principle of
from data. Clustering, the grouping of the objects of a database           ―maximizing the intra class similarity and minimizing the
into meaningful subclasses such that similarity of objects in the          interclass similarity‖. Clustering has its roots in many areas,
same group is maximized and similarity of objects in different             including data mining, statistics, biology, and machine
groups is minimized, is called clustering But my focus is on
                                                                           learning. Clustering methods can be divided into various
partitioning methods only , and proposing a new clustering
algorithm for automatic discovery of data clusters
                                                                           types: Partitioning methods, Hierarchical methods, Density
                                                                           based methods, Grid-based methods, Model based methods,
  Index Terms — data mining, data sets, clustering techniques              Probabilistic techniques, Graph theoretic and Fuzzy methods.
k-Mean.                                                                    But focus of thesis is on partitioning methods only, and
                                                                           proposing a new clustering algorithm for automatic
                                                                           discovery of data clusters.
                         I. INTRODUCTION                                   A. Key Management and LAN
Importance of collecting the data that reflect business or                 Many international organizations produce a huge quantity of
scientific activities to achieve competitive advantage is                  information in a week than many other persons could read in
widely recognized now. Powerful systems for collecting data                a lifetime. The situation is even more alarming in worldwide
and managing it in large databases are in place in all large and           networks like Internet. Everyday hundreds of megabytes of
mid-range organizations. The value of raw data (collected                  data are distributed around the world, but it is no longer
over a long time) is on the ability to extract high-level                  possible to monitor this increasingly rapid development – the
information: information useful for decision support, for                  growth is nearly exponential. So the basic problems with the
exploration, and for better understanding of the phenomena                 management of the data can be summarized below:
generating the data. Traditionally this task of extracting
information was done with the help of analysis where one or                              Analysis required unearthing the hidden
more analysts with the help of statistical techniques provide                             relationships within the data i.e. for decision
summaries and generate reports. Such an approach fails as                                 support.
the volume and dimensionality of the data increase. Hence
tools to aid the automation of analysis tasks are becoming a                             DBMS gave access to the data stored but no
necessity. Thus, data mining was evolved which is                                         analysis of data.
―automatic‖: extraction of patterns of information from data.              Size of databases has increased and needs automated
                                                                           techniques for analysis as they have grown beyond manual
Data mining is the analysis of large observational data sets to
                                                                           extraction.
find unsuspected relationships and to summarize the data in
novel ways that are both understandable and useful to the

Manuscript received july 20, 2012.                                                            II. METHODOLOGY
Bhanu Sukhija, computer science, kurukshetra university y/ N.C College/
N.C and S.D group of institutions, (e-mail:bhanu_sukhija57@yahoo.com).
                                                                           As already stated, dissertation is based on methodology of
Panipat, Haryana, India, 07206089757                                       Iterative Relocating Technique of Partitional Clustering i.e.
Sukhvir Singh, computer science department , Kurukshetra university/ N.C   k-means and k-medoids. All the algorithms are implemented
College/ N.C and S.D group of institutions, Panipat,, Haryana, India, ,    in C++ language and results are studied in same environment.
(e-mail: boora_s@yahoomail.com).
                                                                           But, before going through the study of these algorithms, their

                                                                                                                                               31
                                                   All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                                                 International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                                                Volume 1, Issue 5, July 2012


implementation view and results there are following terms            Let X= {xi | i =1, 2, ……n} be a data set with n objects, k is
used in the implementations of these algorithms                      the number of clusters, mj is the centroid of cluster cj where
                                                                     j=1, 2,…k. Then the algorithm finds the distance between a
                                                                     data object and a centroid by using the following Euclidean
 A. Distance
                                                                     distance formula [1,9,10]
Both of these algorithms are based on Distance among                 Euclidean distance formula= √| xi – mj |2
objects, hence it is necessary to introduce with the distance           .
which we have taken in this dissertation.                                       III. COMPARISION OF ALGORITHMS
Let X= {xi | i =1, 2, ……, n } be a data set with n objects, k is
                                                                     A. Comparison of algorithm’s running time
the number of clusters, mj is the centroid of cluster cj where
j=1, 2,…k. Then the algorithm finds the distance between a
data object and a centroid by using the following Euclidean
distance formula [1, 9, 10]                                                   Name of algorithm                     Computational
Euclidean distance formula= √| xi – mj |  2
                                                                                                                    Complexity
Starting from an initial distribution of cluster centers in data
space, each object is assigned to the cluster with closest                    k-means                               O(nkt)
center, after which each center itself is updated as the center
of mass of all objects belonging to that particular cluster.
                                                                              k-medoids                             O(k(n-k)2)
 B.    Data Set
Data set is the group of n data points on which the                           D-M                                   O(ni)
partitioning approach is to be applied which is of m
dimension. Multidimensional data can include any number
of measurable attributes. Each independent characteristic, or           B. Comparison of SSE (sum of square error)of Algorithm
measurement, is one dimension.The consolidation of large,
multidimensional data sets is the main purpose of the field of
cluster analysis.[7] In this dissertation, data set (Dn) is to be     Name of            Two                            Four
used of two dimensional and having twenty data points. Each           Algorithm
data points is represented by a pixel (x, y) where x is
represented data point’s first dimension and y is represented         k-means            1053.6022                      681.4046
by data point’s second dimension respectively. Visualization
of data set are also produced in the form of set of twenty
pixels.                                                               k-medoids          1082.002                       464.5469

  C. K-Means Step                                                     D-M                1053.6022                      392.6812
The basic step of k-means clustering is to give the number of
clusters k and consider first k objects from data set D as
clusters & their centroid.                                           Each object in each cluster, the distance from object to its
Then the K-means algorithm will do the three steps below             cluster center is squared and the distances are summed. In
until convergence
Iterate until stable (= no object move group):                       other words, square error is the sum of distance for all objects
                                                                     in the data set with respect to the centroid or medoid of
1.    Determine the centroid coordinate
2.    Determine the distance of each object to the centroid          belonging to that object’s cluster. It can also be defined as:
3.    Group the object based on minimum Distance.
                                                                              E = j=1 k ( i=1  n |pij – m j |2)

                                                                     where E is the sum of square error for all objects in the data
D. K-Medoids or partition around medoid (PAM) step:                  set; pij is the point in ith object in jth space/cluster representing
The basic step of K-Medoid or PAM clustering is to give the          a given object, mj is the mean of jth cluster (both pij and mj are
number of clusters k and consider first k objects from data set      multidimensional). This criterion tries to make the resulting
Dn as cluster & their medoid.                                        k clusters as compact and as separate as possible
Then the k-medoid or PAM algorithm will do the following
five steps below until convergence                                                        IV. VALIDATION
Iterative until stable (=no objects change group)
                                                                     Validation is the measurement of similarity between the
1. Design clusters and Compute the Distance over all the             actual classes or groups and clusters results of any clustering
clusters                                                             technique. It can be measured by way of different measures
2. Randomly select non-medoid objects as                             like Purity, Entropy, F-measures and Coefficient of
3. Compute the New Distance w.r.t. new                               Variation. In which the Coefficient of Variation is the best
                                                                     measurement of Validation

                                                                                                                                         32
                                              All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                                                 International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                                                Volume 1, Issue 5, July 2012
Coefficient of Variation is a measure of the data dispersion.
The CV is defined as the ratio of the Standard deviation to the
mean. CV is a dimensionless number that allows comparison
of the variation of populations that have significantly different
mean values. In general the large the CV value is , the greater
the variability is in the data. Coefficient of Variation can be
calculated as follows:
1.        Find the mean of all Classes individually in the data set:

                                 x = i=1n x i /n

 where x is the mean of each class having n number of
 objects.

 2.        Next calculate the Global Mean of All classes by using
           their means :

                                 xg = j=1N xj /N

          where xg is the Global Mean of all N number of classes
          and xj is the mean of that classes.

     3.   Find the Standard Deviation of all the Classes :

                   s=√ j=1N (xj – xg )2 /N

     4.   Find the Coefficient of Variation (CV) :                                        Fig. Output of K-means with k=2
                                                                            B. Implementation view of K-Medoids
            CV = s/xg
                                                                          As shown in Figure here the value of k=2 shows the Two
          Where CV is the Coefficient of Variation, s is the
                                                                          Clusters representing by medoids CD:1 and CD:2
          Standard Deviation and xg is the Global Mean.
                                                                          respectively which are also the data points of the data set.
                                                                          The Deviation in clusters is shown in rights hand side of the
                                                                          figure i.e. 465.2147 in 1st cluster and 616.7882 in 2nd clusters
                         V. RESULTS
                                                                          totaling 1082.02. There are 11 data points in 1st clusters and
                                                                          9 data points in 2nd cluster. It is also important to mention
 The following are the result of the implementation of above              here that the total distance i.e. SSE of k-medoids is trivially
 said algorithms which has been implemented on compiler of                more than k-means i.e. 1053.602.
 C++ with different values of k i.e. the numbers of clusters on
 same set of two dimensional data set having n=20.                        As shown in Figure 4.9 here the value of k=3 shows the
                                                                          Three Clusters representing by medoids CD:1, CD:2 and
     A. Implementation view of K-Means                                    CD:3 respectively which are also the data points of the data
                                                                          set. The Deviation in clusters is shown in rights hand side of
 As shown in Figure here the value of k=2 shows the Two                   the figure i.e. 118.6493 in 1st cluster, 522.7472 in 2nd clusters
 Clusters representing by centroids CD:1 and CD:2                         and 145.4263 in 3rd cluster totaling 786.8229. There are 5
 respectively. The Deviation in clusters is shown in rights               data points in 1st clusters, 10 data points in 2nd cluster and 5
 hand side of the figure i.e. 526.8011 in each cluster totaling           data points in 3rd clusters. It is again considerable that the
 1053.602. The numbers of data points is ten in each cluster.             total distance i.e. SSE of k-medoids is trivially more than
                                                                          k-means i.e. 723.1417.




                                                                                                                                               33
                                                      All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                                               International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                                              Volume 1, Issue 5, July 2012


                                                                    [5] Hand & Others: Principal of Data Mining
                                                                   [6] Pavel Brakhin: ―Survey of Clustering Data Mining
                                                                   Techniques‖ .
                                                                   [7] Vance Faber, ―Clustering and the Continuous k-Means
                                                                   Algorithm‖.
                                                                   [8] Introduction of Clustering

                                                                   [9]. Shang ,Yi &Li, Guo-Jie(1991) New Crossover Operators
                                                                   In Genetic Algorithms, , P. R. China: National Research
                                                                   Center for Intelligent Computing Systems (NCIC)

                                                                   [10]. Singh ,Vijendra & Choudhary ,Simran (2009) Genetic
                                                                   Algorithm for Traveling Salesman Problem: Using Modified
                                                                   Partially-Mapped CrossoverOperator,sikar,Rajasthan,India:
                                                                   Department of Computer Science & Engineering, Faculty of
                                                                   Engineering & Technology, Mody Institute of Technology &
                                                                   Science, Lakshmangarh

                                                                   [11]. Su, Fanchen et al(2009) New Crossover Operator of
                                                                   Genetic Algorithms for the TSP, P.R. China: Computer
                                                                   School of Wuhan University Wuhan.

                                                                   Ms. Bhanu Sukhija obtained the B.Tech degree in
                                                                   Computer Science and Engineering from Kurukshetra
                                                                                      University, Kurukshetra, India in 2009.
                                                                                      She is presently working as Lecturer in
                                                                                      Department of Computer Science and
                                                                                      Engineering at N. C. College of
           Fig. Output of K-Medoids with k=2                                          Engineering, Israna, Panipat. She is
                                                                                      presently pursuing her M.Tech from the
                                                                                      same institute. She has guided several
                     VI. CONCLUSION                                                   B.Tech projects. Her research interests
                                                                   include data mining.
In partitioning based clustering algorithms, the number of
final clusters (k) needs to be determined beforehand. Also,
algorithms have problems like susceptible to local optima,         Mr. Sukhvir Singh obtained M.Tech (Integrated) degree in
sensitive to outliers, memory space, validity and unknown                              Software Engineering and System
number of iteration steps required to cluster. So an algorithm                         Analysis in 1996. He is presently
which explores number of clusters automatically considering                            pursuing PhD from M.D University,
all these problems is an active research area and most of the                          Rohtak. He has worked at P.D.M
research has focused on effectiveness and scalability of the                           Polytechnic,      Sarai    Aurangabad,
algorithm. In this paper, a partitioning based algorithm, D-M                          Bahadurgarh, P.D.M College of
(Density Means), clustering has been considered which              Engineering, Bahadurgarh.
generate clusters automatically.                                   He is presently heading department of computer science at
   .                                                               N.C College of engineering, Israna (Panipat) and also He has
                                                                   received Grant of Rs. 4 Lac from AICTE for the project of
                                                                   MODROBS for Advance Computer Network Lab. He is also
                    ACKNOWLEDGMENT
                                                                    Member of Computer Society of India and Life Time
The authors are quite thankful to the Computer science             Member of ISTE.
department at NC College of Engineering for supporting our
work.



                    REFERENCES

[1] J. Han and M. Kamber K. Data Mining: Concepts and
Techniques. Morgan Kaufman, 2000.
[2] P.S Bradley, Usama M. Fayyad., ―Initial Points for
K-Means Clustering.‖, Advances in Knowledge Discovery
and Data Mining. MIT Press .
[3] ―Introduction to Data Mining and Knowledge
Discovery” 3rd Edition by Two Crows Corporation.
[4] Teknomo, kardi. ―K-Means clustering tutorial.”

                                                                                                                                       34
                                            All Rights Reserved © 2012 IJARCSEE

More Related Content

What's hot

Distributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic WebDistributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic WebEditor IJCATR
 
A scenario based approach for dealing with
A scenario based approach for dealing withA scenario based approach for dealing with
A scenario based approach for dealing withijcsa
 
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...cscpconf
 
Multispectral image analysis using random
Multispectral image analysis using randomMultispectral image analysis using random
Multispectral image analysis using randomijsc
 
An Extensive Review on Generative Adversarial Networks GAN’s
An Extensive Review on Generative Adversarial Networks GAN’sAn Extensive Review on Generative Adversarial Networks GAN’s
An Extensive Review on Generative Adversarial Networks GAN’sijtsrd
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Editor IJARCET
 
Bs31267274
Bs31267274Bs31267274
Bs31267274IJMER
 
On distributed fuzzy decision trees for big data
On distributed fuzzy decision trees for big dataOn distributed fuzzy decision trees for big data
On distributed fuzzy decision trees for big datanexgentechnology
 
Evaluating the efficiency of rule techniques for file classification
Evaluating the efficiency of rule techniques for file classificationEvaluating the efficiency of rule techniques for file classification
Evaluating the efficiency of rule techniques for file classificationeSAT Journals
 
Ppback propagation-bansal-zhong-2010
Ppback propagation-bansal-zhong-2010Ppback propagation-bansal-zhong-2010
Ppback propagation-bansal-zhong-2010girivaishali
 
Evaluating the efficiency of rule techniques for file
Evaluating the efficiency of rule techniques for fileEvaluating the efficiency of rule techniques for file
Evaluating the efficiency of rule techniques for fileeSAT Publishing House
 
A Stacked Generalization Ensemble Approach for Improved Intrusion Detection
A Stacked Generalization Ensemble Approach for Improved Intrusion DetectionA Stacked Generalization Ensemble Approach for Improved Intrusion Detection
A Stacked Generalization Ensemble Approach for Improved Intrusion DetectionIJCSIS Research Publications
 
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...iosrjce
 
Brief bibliography of interestingness measure, bayesian belief network and ca...
Brief bibliography of interestingness measure, bayesian belief network and ca...Brief bibliography of interestingness measure, bayesian belief network and ca...
Brief bibliography of interestingness measure, bayesian belief network and ca...Adnan Masood
 
Comprehensive Survey of Data Classification & Prediction Techniques
Comprehensive Survey of Data Classification & Prediction TechniquesComprehensive Survey of Data Classification & Prediction Techniques
Comprehensive Survey of Data Classification & Prediction Techniquesijsrd.com
 
Frequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaFrequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaIJERA Editor
 

What's hot (18)

Distributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic WebDistributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic Web
 
A scenario based approach for dealing with
A scenario based approach for dealing withA scenario based approach for dealing with
A scenario based approach for dealing with
 
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
 
Multispectral image analysis using random
Multispectral image analysis using randomMultispectral image analysis using random
Multispectral image analysis using random
 
An Extensive Review on Generative Adversarial Networks GAN’s
An Extensive Review on Generative Adversarial Networks GAN’sAn Extensive Review on Generative Adversarial Networks GAN’s
An Extensive Review on Generative Adversarial Networks GAN’s
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
 
Bs31267274
Bs31267274Bs31267274
Bs31267274
 
On distributed fuzzy decision trees for big data
On distributed fuzzy decision trees for big dataOn distributed fuzzy decision trees for big data
On distributed fuzzy decision trees for big data
 
Evaluating the efficiency of rule techniques for file classification
Evaluating the efficiency of rule techniques for file classificationEvaluating the efficiency of rule techniques for file classification
Evaluating the efficiency of rule techniques for file classification
 
Ppback propagation-bansal-zhong-2010
Ppback propagation-bansal-zhong-2010Ppback propagation-bansal-zhong-2010
Ppback propagation-bansal-zhong-2010
 
Evaluating the efficiency of rule techniques for file
Evaluating the efficiency of rule techniques for fileEvaluating the efficiency of rule techniques for file
Evaluating the efficiency of rule techniques for file
 
A Stacked Generalization Ensemble Approach for Improved Intrusion Detection
A Stacked Generalization Ensemble Approach for Improved Intrusion DetectionA Stacked Generalization Ensemble Approach for Improved Intrusion Detection
A Stacked Generalization Ensemble Approach for Improved Intrusion Detection
 
1 5
1 51 5
1 5
 
169 s170
169 s170169 s170
169 s170
 
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...
A Hierarchical and Grid Based Clustering Method for Distributed Systems (Hgd ...
 
Brief bibliography of interestingness measure, bayesian belief network and ca...
Brief bibliography of interestingness measure, bayesian belief network and ca...Brief bibliography of interestingness measure, bayesian belief network and ca...
Brief bibliography of interestingness measure, bayesian belief network and ca...
 
Comprehensive Survey of Data Classification & Prediction Techniques
Comprehensive Survey of Data Classification & Prediction TechniquesComprehensive Survey of Data Classification & Prediction Techniques
Comprehensive Survey of Data Classification & Prediction Techniques
 
Frequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaFrequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social Media
 

Viewers also liked (8)

114 120
114 120114 120
114 120
 
46 51
46 5146 51
46 51
 
99 103
99 10399 103
99 103
 
ภารกิจระดับครูผู้ช่วย
ภารกิจระดับครูผู้ช่วยภารกิจระดับครูผู้ช่วย
ภารกิจระดับครูผู้ช่วย
 
12 15
12 1512 15
12 15
 
24 27
24 2724 27
24 27
 
7 13
7 137 13
7 13
 
122 129
122 129122 129
122 129
 

Similar to 31 34

Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...Editor IJCATR
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...Nicolle Dammann
 
A Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data MiningA Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data MiningBRNSSPublicationHubI
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Reviewijdpsjournal
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.docbutest
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICS
A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICSA STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICS
A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICSijistjournal
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for properIJDKP
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemIJSRD
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemIJSRD
 
Data Mining Framework for Network Intrusion Detection using Efficient Techniques
Data Mining Framework for Network Intrusion Detection using Efficient TechniquesData Mining Framework for Network Intrusion Detection using Efficient Techniques
Data Mining Framework for Network Intrusion Detection using Efficient TechniquesIJAEMSJORNAL
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
 

Similar to 31 34 (20)

Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...
 
A Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data MiningA Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data Mining
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Review
 
H017625354
H017625354H017625354
H017625354
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
 
15 19
15 1915 19
15 19
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICS
A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICSA STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICS
A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICS
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
 
Analysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry SystemAnalysis on Student Admission Enquiry System
Analysis on Student Admission Enquiry System
 
Data Mining Framework for Network Intrusion Detection using Efficient Techniques
Data Mining Framework for Network Intrusion Detection using Efficient TechniquesData Mining Framework for Network Intrusion Detection using Efficient Techniques
Data Mining Framework for Network Intrusion Detection using Efficient Techniques
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
 

More from Ijarcsee Journal (20)

130 133
130 133130 133
130 133
 
116 121
116 121116 121
116 121
 
109 115
109 115109 115
109 115
 
104 108
104 108104 108
104 108
 
93 98
93 9893 98
93 98
 
88 92
88 9288 92
88 92
 
82 87
82 8782 87
82 87
 
78 81
78 8178 81
78 81
 
73 77
73 7773 77
73 77
 
65 72
65 7265 72
65 72
 
58 64
58 6458 64
58 64
 
52 57
52 5752 57
52 57
 
41 45
41 4541 45
41 45
 
36 40
36 4036 40
36 40
 
28 35
28 3528 35
28 35
 
19 23
19 2319 23
19 23
 
16 18
16 1816 18
16 18
 
6 11
6 116 11
6 11
 
134 138
134 138134 138
134 138
 
125 131
125 131125 131
125 131
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

31 34

  • 1. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 5, July 2012 Improved K-Means clustering technique using distance determination approach Bhanu Sukhija, Sukhvir Singh  data owner. OR Data Mining also refers to extracting or Abstract— Powerful systems for collecting data and managing mining from large amount of data. This is also sometimes it in large databases are in place in all large and mid-range referred to as Knowledge Discovery from Data (KDD). Data organizations. The value of raw data (collected over a long time) mining tools predict future trends and behaviors, allowing is on the ability to extract high-level information: information businesses to make proactive, knowledge-driven decisions. useful for decision support, for exploration, and for better Data mining tools can answer business questions that understanding of the phenomena generating the data. traditionally were time consuming to resolve. Traditionally this task of extracting information was done with the help of analysis where one or more analysts with the help of statistical techniques provide summaries and generate reports. Clustering, the grouping of the objects of a database into Such an approach fails as the volume and dimensionality of the meaningful subclasses such that similarity of objects in the data increase. Hence tools to aid the automation of analysis same group is maximized and similarity of objects in tasks are becoming a necessity. Thus, data mining was evolved different groups is minimized, is called clustering. Thus the which is “automatic”: extraction of patterns of information objects are clustered or grouped based on the principle of from data. Clustering, the grouping of the objects of a database ―maximizing the intra class similarity and minimizing the into meaningful subclasses such that similarity of objects in the interclass similarity‖. Clustering has its roots in many areas, same group is maximized and similarity of objects in different including data mining, statistics, biology, and machine groups is minimized, is called clustering But my focus is on learning. Clustering methods can be divided into various partitioning methods only , and proposing a new clustering algorithm for automatic discovery of data clusters types: Partitioning methods, Hierarchical methods, Density based methods, Grid-based methods, Model based methods, Index Terms — data mining, data sets, clustering techniques Probabilistic techniques, Graph theoretic and Fuzzy methods. k-Mean. But focus of thesis is on partitioning methods only, and proposing a new clustering algorithm for automatic discovery of data clusters. I. INTRODUCTION A. Key Management and LAN Importance of collecting the data that reflect business or Many international organizations produce a huge quantity of scientific activities to achieve competitive advantage is information in a week than many other persons could read in widely recognized now. Powerful systems for collecting data a lifetime. The situation is even more alarming in worldwide and managing it in large databases are in place in all large and networks like Internet. Everyday hundreds of megabytes of mid-range organizations. The value of raw data (collected data are distributed around the world, but it is no longer over a long time) is on the ability to extract high-level possible to monitor this increasingly rapid development – the information: information useful for decision support, for growth is nearly exponential. So the basic problems with the exploration, and for better understanding of the phenomena management of the data can be summarized below: generating the data. Traditionally this task of extracting information was done with the help of analysis where one or  Analysis required unearthing the hidden more analysts with the help of statistical techniques provide relationships within the data i.e. for decision summaries and generate reports. Such an approach fails as support. the volume and dimensionality of the data increase. Hence tools to aid the automation of analysis tasks are becoming a  DBMS gave access to the data stored but no necessity. Thus, data mining was evolved which is analysis of data. ―automatic‖: extraction of patterns of information from data. Size of databases has increased and needs automated techniques for analysis as they have grown beyond manual Data mining is the analysis of large observational data sets to extraction. find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the Manuscript received july 20, 2012. II. METHODOLOGY Bhanu Sukhija, computer science, kurukshetra university y/ N.C College/ N.C and S.D group of institutions, (e-mail:bhanu_sukhija57@yahoo.com). As already stated, dissertation is based on methodology of Panipat, Haryana, India, 07206089757 Iterative Relocating Technique of Partitional Clustering i.e. Sukhvir Singh, computer science department , Kurukshetra university/ N.C k-means and k-medoids. All the algorithms are implemented College/ N.C and S.D group of institutions, Panipat,, Haryana, India, , in C++ language and results are studied in same environment. (e-mail: boora_s@yahoomail.com). But, before going through the study of these algorithms, their 31 All Rights Reserved © 2012 IJARCSEE
  • 2. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 5, July 2012 implementation view and results there are following terms Let X= {xi | i =1, 2, ……n} be a data set with n objects, k is used in the implementations of these algorithms the number of clusters, mj is the centroid of cluster cj where j=1, 2,…k. Then the algorithm finds the distance between a data object and a centroid by using the following Euclidean A. Distance distance formula [1,9,10] Both of these algorithms are based on Distance among Euclidean distance formula= √| xi – mj |2 objects, hence it is necessary to introduce with the distance . which we have taken in this dissertation. III. COMPARISION OF ALGORITHMS Let X= {xi | i =1, 2, ……, n } be a data set with n objects, k is A. Comparison of algorithm’s running time the number of clusters, mj is the centroid of cluster cj where j=1, 2,…k. Then the algorithm finds the distance between a data object and a centroid by using the following Euclidean distance formula [1, 9, 10] Name of algorithm Computational Euclidean distance formula= √| xi – mj | 2 Complexity Starting from an initial distribution of cluster centers in data space, each object is assigned to the cluster with closest k-means O(nkt) center, after which each center itself is updated as the center of mass of all objects belonging to that particular cluster. k-medoids O(k(n-k)2) B. Data Set Data set is the group of n data points on which the D-M O(ni) partitioning approach is to be applied which is of m dimension. Multidimensional data can include any number of measurable attributes. Each independent characteristic, or B. Comparison of SSE (sum of square error)of Algorithm measurement, is one dimension.The consolidation of large, multidimensional data sets is the main purpose of the field of cluster analysis.[7] In this dissertation, data set (Dn) is to be Name of Two Four used of two dimensional and having twenty data points. Each Algorithm data points is represented by a pixel (x, y) where x is represented data point’s first dimension and y is represented k-means 1053.6022 681.4046 by data point’s second dimension respectively. Visualization of data set are also produced in the form of set of twenty pixels. k-medoids 1082.002 464.5469 C. K-Means Step D-M 1053.6022 392.6812 The basic step of k-means clustering is to give the number of clusters k and consider first k objects from data set D as clusters & their centroid. Each object in each cluster, the distance from object to its Then the K-means algorithm will do the three steps below cluster center is squared and the distances are summed. In until convergence Iterate until stable (= no object move group): other words, square error is the sum of distance for all objects in the data set with respect to the centroid or medoid of 1. Determine the centroid coordinate 2. Determine the distance of each object to the centroid belonging to that object’s cluster. It can also be defined as: 3. Group the object based on minimum Distance. E = j=1 k ( i=1  n |pij – m j |2) where E is the sum of square error for all objects in the data D. K-Medoids or partition around medoid (PAM) step: set; pij is the point in ith object in jth space/cluster representing The basic step of K-Medoid or PAM clustering is to give the a given object, mj is the mean of jth cluster (both pij and mj are number of clusters k and consider first k objects from data set multidimensional). This criterion tries to make the resulting Dn as cluster & their medoid. k clusters as compact and as separate as possible Then the k-medoid or PAM algorithm will do the following five steps below until convergence IV. VALIDATION Iterative until stable (=no objects change group) Validation is the measurement of similarity between the 1. Design clusters and Compute the Distance over all the actual classes or groups and clusters results of any clustering clusters technique. It can be measured by way of different measures 2. Randomly select non-medoid objects as like Purity, Entropy, F-measures and Coefficient of 3. Compute the New Distance w.r.t. new Variation. In which the Coefficient of Variation is the best measurement of Validation 32 All Rights Reserved © 2012 IJARCSEE
  • 3. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 5, July 2012 Coefficient of Variation is a measure of the data dispersion. The CV is defined as the ratio of the Standard deviation to the mean. CV is a dimensionless number that allows comparison of the variation of populations that have significantly different mean values. In general the large the CV value is , the greater the variability is in the data. Coefficient of Variation can be calculated as follows: 1. Find the mean of all Classes individually in the data set: x = i=1n x i /n where x is the mean of each class having n number of objects. 2. Next calculate the Global Mean of All classes by using their means : xg = j=1N xj /N where xg is the Global Mean of all N number of classes and xj is the mean of that classes. 3. Find the Standard Deviation of all the Classes : s=√ j=1N (xj – xg )2 /N 4. Find the Coefficient of Variation (CV) : Fig. Output of K-means with k=2 B. Implementation view of K-Medoids CV = s/xg As shown in Figure here the value of k=2 shows the Two Where CV is the Coefficient of Variation, s is the Clusters representing by medoids CD:1 and CD:2 Standard Deviation and xg is the Global Mean. respectively which are also the data points of the data set. The Deviation in clusters is shown in rights hand side of the figure i.e. 465.2147 in 1st cluster and 616.7882 in 2nd clusters V. RESULTS totaling 1082.02. There are 11 data points in 1st clusters and 9 data points in 2nd cluster. It is also important to mention The following are the result of the implementation of above here that the total distance i.e. SSE of k-medoids is trivially said algorithms which has been implemented on compiler of more than k-means i.e. 1053.602. C++ with different values of k i.e. the numbers of clusters on same set of two dimensional data set having n=20. As shown in Figure 4.9 here the value of k=3 shows the Three Clusters representing by medoids CD:1, CD:2 and A. Implementation view of K-Means CD:3 respectively which are also the data points of the data set. The Deviation in clusters is shown in rights hand side of As shown in Figure here the value of k=2 shows the Two the figure i.e. 118.6493 in 1st cluster, 522.7472 in 2nd clusters Clusters representing by centroids CD:1 and CD:2 and 145.4263 in 3rd cluster totaling 786.8229. There are 5 respectively. The Deviation in clusters is shown in rights data points in 1st clusters, 10 data points in 2nd cluster and 5 hand side of the figure i.e. 526.8011 in each cluster totaling data points in 3rd clusters. It is again considerable that the 1053.602. The numbers of data points is ten in each cluster. total distance i.e. SSE of k-medoids is trivially more than k-means i.e. 723.1417. 33 All Rights Reserved © 2012 IJARCSEE
  • 4. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 5, July 2012 [5] Hand & Others: Principal of Data Mining [6] Pavel Brakhin: ―Survey of Clustering Data Mining Techniques‖ . [7] Vance Faber, ―Clustering and the Continuous k-Means Algorithm‖. [8] Introduction of Clustering [9]. Shang ,Yi &Li, Guo-Jie(1991) New Crossover Operators In Genetic Algorithms, , P. R. China: National Research Center for Intelligent Computing Systems (NCIC) [10]. Singh ,Vijendra & Choudhary ,Simran (2009) Genetic Algorithm for Traveling Salesman Problem: Using Modified Partially-Mapped CrossoverOperator,sikar,Rajasthan,India: Department of Computer Science & Engineering, Faculty of Engineering & Technology, Mody Institute of Technology & Science, Lakshmangarh [11]. Su, Fanchen et al(2009) New Crossover Operator of Genetic Algorithms for the TSP, P.R. China: Computer School of Wuhan University Wuhan. Ms. Bhanu Sukhija obtained the B.Tech degree in Computer Science and Engineering from Kurukshetra University, Kurukshetra, India in 2009. She is presently working as Lecturer in Department of Computer Science and Engineering at N. C. College of Fig. Output of K-Medoids with k=2 Engineering, Israna, Panipat. She is presently pursuing her M.Tech from the same institute. She has guided several VI. CONCLUSION B.Tech projects. Her research interests include data mining. In partitioning based clustering algorithms, the number of final clusters (k) needs to be determined beforehand. Also, algorithms have problems like susceptible to local optima, Mr. Sukhvir Singh obtained M.Tech (Integrated) degree in sensitive to outliers, memory space, validity and unknown Software Engineering and System number of iteration steps required to cluster. So an algorithm Analysis in 1996. He is presently which explores number of clusters automatically considering pursuing PhD from M.D University, all these problems is an active research area and most of the Rohtak. He has worked at P.D.M research has focused on effectiveness and scalability of the Polytechnic, Sarai Aurangabad, algorithm. In this paper, a partitioning based algorithm, D-M Bahadurgarh, P.D.M College of (Density Means), clustering has been considered which Engineering, Bahadurgarh. generate clusters automatically. He is presently heading department of computer science at . N.C College of engineering, Israna (Panipat) and also He has received Grant of Rs. 4 Lac from AICTE for the project of MODROBS for Advance Computer Network Lab. He is also ACKNOWLEDGMENT Member of Computer Society of India and Life Time The authors are quite thankful to the Computer science Member of ISTE. department at NC College of Engineering for supporting our work. REFERENCES [1] J. Han and M. Kamber K. Data Mining: Concepts and Techniques. Morgan Kaufman, 2000. [2] P.S Bradley, Usama M. Fayyad., ―Initial Points for K-Means Clustering.‖, Advances in Knowledge Discovery and Data Mining. MIT Press . [3] ―Introduction to Data Mining and Knowledge Discovery” 3rd Edition by Two Crows Corporation. [4] Teknomo, kardi. ―K-Means clustering tutorial.” 34 All Rights Reserved © 2012 IJARCSEE