1. Implementation of Integrated Approach of
K-means Clustering Algorithm for
Prediction Analysis
For Partial Fulfillment for the Degree to be
awarded by
Gujarat Technological University
Presented by
Manisha Goyal(160130702006)
Carried Out at
Government Engineering College, Gandhinagar
Under the Supervision of
Prof. M.B. Chaudhari
Dissertation Phase-II Presentation
On
2. Layout
Motivation and Objective of research work
Theoretical Background
Literature Review and Comparative Study
Problem Identification
Existing v/s Proposed Methodology
DP-1 and MSR Comments with Solutions
Implementation of Proposed Work
Results Analysis
Conclusion
Bibliography
Paper Publication Certificate
3. Motivation of Work
K-means is very old concept which is increasing in popularity day by day because of its simplicity and
linear time complexity. However, it has two main disadvantages :- 1) Its highly sensitive to outlier and 2)
Its highly dependent on initialization parameters (random choice of k clusters and position of initial cluster
centroids). Many improved variants of K-means method are detailed in literature but still it is an open field
of research because of its extensive application in the field of Medical, Business & Marketing, Social
Media –Sentiment Analysis etc.
Overlapping K-means is an extended version of K –means, which is fairly a new concept and widely using
in various field where overlapping clusters are require. As it is an extension of K-means, it also needs
improvements. There is a lot of scope for improvement as far its accuracy and dependability is
concerned.
4. Objective
“The goal of this research work is to improve the accuracy of existing overlapping K-means
Clustering by removing its dependency on initialization parameters (random choice of k clusters
and placement of initial cluster centroids) and to evaluate the results using different measures for
different applications”.
To achieve this objective, the proposed algorithm performs the following steps: -
1) Preprocessing of raw dataset.
2) Calculate the optimum value of K (entirely based on the dataset, NOT as input from user).
3) Find position of initial centroids (using Proposed Harmonic Means method, not as random input
from user) and then using above results apply OKM.
6. Clustering
• Objective: To find natural groupings among objects
• It is an unsupervised learning problem which deals with finding a structure in a collection of
unlabelled data.
• Organize data in such a way that there is a
7. Clustering category based on
generated clusters
1. Exclusive (Non-overlapping) Clustering
2. Overlapping Clustering
8. Why Overlapping Clustering??
Most of existing clustering methods assume that each data observation belongs to one and only
one cluster leading to k disjoint clusters explaining the data. However, in many applications the
data being modeled can have a much richer and more complex hidden representation where
observations actually belong to multiple clusters.
•In Social Network Analysis, community extraction algorithms need to detect overlapping clusters
where an actor can belong to multiple communities..
• In Text Clustering, learning methods should be able to group document, which discuss more than
one topic, into several groups.
•In Medical Domain, various diseases share some common overlapping symptoms such as fever is
common symptom in typhoid, malaria, viral infection and many others.
9. K-means clustering
• Partitioning method for clustering.
• Objective: takes the input parameter, k, and partitions a set of n objects into k clusters.
• Dissimilarity measures of K-means are:
Euclidean Distance
Manhattan Distance
• Cluster mean is used to update the centroid of that cluster.
• The aim of K-means is to minimize the objective function or the square-error criterion, defined as:
𝐸 = 𝑖=1
𝑘
𝑝∈𝑐𝑖
𝑝 − 𝑚𝑖
2
Where E is the sum of the square error for all objects in the data set; p is the point in space representing a
given object; and mi is the mean of cluster Ci (both p and mi are multidimensional).
10. How does K-means Work??
1. Initialization:
• Randomly Choose cluster centroid for K=2
11. 2. Cluster Assignment:
• Compute the distance between the data points and the cluster centroid by using dissimilarity
measures.
• Depending upon the minimum distance, data points are divided into 2 clusters.
12. 3. Move Centroid:
• Compute the mean of blue dots and reposition blue centroid to this mean
• Compute the mean of orange dots and reposition orange centroid to this mean
13. 4. Optimization and Convergence:
• Repeat previous 2 steps iteratively till the cluster centroid stop changing their position.
• At some point cluster does not change for further computation that is the point when algorithm
converges
• Below is the final cluster.
14. Advantages and Disadvantages
Advantages:
• Easy to implement and robust.
• Relatively scalable and efficient in processing large data sets with linear time complexity.
• Produce tighter clusters than hierarchical clustering.
Disadvantages:
• Applied only when the mean of a cluster is defined.
• Cannot be applied on categorical attributes.
• Sensitive to the selection of number of a clusters k and initial cluster center.
• Not suitable for discovering clusters with nonconvex shapes or clusters of very different size.
• Sensitive to noise and outlier data points.
15. Overlapping K-means (OKM)
• The OKM method extends the objective function used in K-means to consider the possibility of
overlapping clusters.
• The K-means algorithm aims at clustering 𝑋 = 𝑥𝑖, … . . 𝑥 𝑛 into k clusters by minimizing the
following objective function
𝑄 𝜋 = 𝑗=1
𝑘
𝑥 𝑖∈𝜋 𝑗
𝑥𝑖 − 𝑧𝑗
2
Where, 𝑥𝑖 is a ʋ-dimensional set of observations, 𝜋 = { 𝜋1, … … 𝜋 𝑘 } is the set of k clusters (𝜋𝑖 ∩
𝜋𝑗 = ∅), and 𝑍 = {𝑧1, … . 𝑧 𝑘} is the set of cluster centroids.
16. Cont…
• The OKM approach relaxes the objective function of K-means to allow overlapping by removing
the constraint 𝜋𝑖 ∩ 𝜋𝑗 = ∅, for 𝑖 ≠ 𝑗.
• Objective function of OKM is defined as
𝑄′ 𝜋 = 𝑗=1
𝑛
𝑥𝑖 − ɸ 𝑥𝑖
2
• The ɸ 𝑥𝑖 = (ɸ1 𝑥𝑖 , … . ɸ 𝑟 𝑥𝑖 ) is the representation of 𝑥𝑖 also called ‘image’ or ‘barycenter
of cluster's’ of 𝑥𝑖 defined as a combination of the centroids 𝑧𝑗 of the clusters 𝜋𝑗 where 𝑥𝑖 belongs
to, computed as
ɸ 𝑝 𝑥𝑖 =
𝑧 𝑗 𝜖𝜋(𝑥 𝑖) 𝑧 𝑗
|𝜋 𝑥 𝑖 |
,
17. Cont…
Here, the centroid 𝑧𝑗 ∈ 𝜋(𝑥𝑖) , where 𝜋(𝑥𝑖) is the list of all clusters that 𝑥𝑖 belongs to. The
centroid 𝑧𝑗 is updated using the following equation:
𝑧𝑗
∗
=
1
𝑥 𝑖∈𝜋 𝑗
1
𝛿 𝑖
2
𝑥 𝑖 𝜖𝜋 𝑗
1
𝛿 𝑖
2 (𝛿𝑖 × 𝑥𝑖 − 𝑧 𝑗∈𝜋(𝑥 𝑖)/𝑧 𝑗
𝑧𝑗),
Where 𝛿𝑖 is the total number of clusters that 𝑥𝑖 belongs to (in this case 𝛿𝑖 = |𝜋 𝑥𝑖 |).
18. Evaluation metrics
1. Sum of Square Error (SSE)
2. between_ss/total_ss Ratio
3. Number of Iterations
4. F-Measures and FBCubed Measures
5. Rand Index
20. Literature Review and Comparative Study
Title Publication
1. Applications of Partition based Clustering Algorithms: A Survey IEEE 2013
2. Performance Evaluation of a Novel Hybrid Clustering Algorithm using Birch and K-
Means
IEEE 2015
3. Disease Prediction using Hybrid K-means and Support Vector Machine IEEE 2016
4. Sorted K-Means Towards the Enhancement of K-Means to Form Stable Clusters Springer 2017
5. An enhanced deterministic K-means clustering algorithm for cancer subtype prediction
from gene expression data
Elsevier 2017
6. An Improved Overlapping k-Means Clustering Method for Medical Applications Elsevier 2017
21. Title Techniques used Strength Weakness
1. Performance Evaluation of
a Novel Hybrid Clustering
Algorithm using Birch and
K-Means
K-means clustering
algorithm and BIRCH
Better performance than K-
Means and K-Medoid
clustering
It can handle large datasets
more effectively.
Results are vary with k
values
Computation time can be
reduced further
2. Disease Prediction using
Hybrid K-means and
Support Vector Machine
hybrid K-means algorithm
which uses the silhouette
values to find k values
initial centroids and Support
Vector Machine algorithm
The K-means achieved the
accuracy of 82% and the hybrid
algorithm achieved the accuracy
of 92% on the same dataset
Accuracy can further be improved
by using improved k-Means
algorithm and SVM.
3. Sorted K-Means Towards
the Enhancement of K-
Means to Form Stable
Clusters
Sorted (Merge or Quick
sort) K-Means which
determines initial centroids
Effectively and efficiently used
to form stable clusters with less
number of iterations.
Space and Time complexity for
big data can be improved further
22. Title Techniques used Strength Weakness
4. An enhanced
deterministic K-means
clustering algorithm for
cancer subtype prediction
from gene expression data
density based version of
K-Means
The overall performance to
the others compared
algorithms
It does not deal with outliers
5. An Improved
Overlapping k-Means
Clustering Method for
Medical Applications
k-harmonic means and
overlapping k-means
algorithms (KHMOKM)
Better performance than
OKM
Better minimization of
objective function
Rely on the Euclidean
distance.
The algorithm depends on
the initial selection of the
number of clusters k.
23. Chapter 3
Problem Identification
Some of the following problems, identified during literature review are as follows.
1. Computation time of integrated approach is a big issue because integrating two approaches increases
time complexity which is not acceptable. [paper2]
2. Some algorithms do not deal properly with outliers which decreases the accuracy overall for large
datasets. [paper5]
3. Some algorithms do not work well with large datasets because of space complexity issue. [paper3]
4. Its increases the complexity of the algorithm when integrated KHM-OKM method is used. [paper6]
5. Most algorithms rely on Euclidean distances to find closest distances from centroids which is not
suitable for all types of datasets.
25. Chapter 5
DP-1 and MSR Comments
Sr. NO. DP1 Comments by External Status
1. Good Literature Review
2. Detailed Algorithm needs to be prepared. Done in MSR
3. Evaluate Complexity of your proposed approach Done in MSR
Sr. NO. MSR Comments by External Status
1. Proposed Algorithm needs to be implemented with
sufficiently large dataset
Done
2. Implementation and results of work should be displayed
during DP-2
Done
26. MSR Comments with solution
Comment 1: Proposed Algorithm needs to be implemented with sufficiently large dataset.
Solution- We have taken following two large datasets ( Lung Cancer and Diabetes Disease):
Sr. No. Dataset Name Size
1. Lung Cancer Dataset 1000*25
2. Diabetes Disease Dataset 768*9
Comment 2: Implementation and results of work should be displayed during DP-2
Solution: Implementation of Proposed algorithm and Results are described in succeeding
sections.
29. Chapter 6
Implementation of Proposed Work
1. Coding and Analysis is done in RStudio
2. Implementation of Integrated OKM in Weka4OC toolbox
3. Recording and analyzing Results in Excel sheet 2013
30. Step 1: Import Dataset in R Studio and Analysis on it (Lung Cancer Dataset).
34. Chapter 7
Result Analysis
(Original OKM v/s Integrated OKM)
1. Existing Methodology (original OKM)
Random users inputs
Run the above random inputs in Weka4Oc tool
2. Proposed Integrated OKM methodology
Step 1: Determine appropriate k-value through algorithm
Step 2: Calculation of initial centroid position through proposed harmonic means method
Step 3: Run the inputs generated through above algorithm in Weka4Oc
35. Proposed OKM Methodology
Step 1: Determine appropriate K Value through algorithm
Ratio of between _SS/Total_SS
Methods to find K Lung Cancer Dataset Diabetes Data Set
K=3_ Elbow Method 50.70% 74.90%
K=2_Silhoutte Method 39.30% 55.70%
K=1_Gap static Method 0.00% 0.00%
Step 2 : Calculation of initial centroid position through Proposed Harmonic
Means (post pre-processing, if required)
Step 3: Input best value of K and initial centroid position calculated from above
step in Weka4Oc
36. Comparison of Results:
Lung Cancer Dataset
‘8 different scenarios where 4 users enter random values in 8 different ways’ V/S the ‘integrated OKM algorithm
R
A
N
D
O
M
INPUTS
PROPOSED
OKM
User
Value
of K
Initial Centorid position
Value of Initial
Centroid position
precision Recall
F-
measure
Rand
Index
Bcubed
Precision
#
Iteration
s
User 1_R 2 Randomly generated NA 0.143 0.904 0.247 0.244 0.107 11
User 1_U 2
Random Input value by
user
56,789 0.141 0.911 0.243 0.223 0.114 8
User 2_R 3 Randomly generated NA 0.159 0.777 0.264 0.352 0.116 7
User 2_U 3
Random Input value by
user
45,578,899 0.142 0.798 0.242 0.253 0.116 9
User 3_R 4 Randomly generated NA 0.169 0.852 0.283 0.293 0.1 9
User 3_U 4
Random Input value by
user
23,456,678,890 0.169 0.801 0.278 0.322 0.113 11
User 4_R 5 Randomly generated NA 0.188 0.853 0.309 0.325 0.101 9
User 4_U 5
Random Input value by
user
23, 456, 658, 123,
897
0.196 0.631 0.3 0.479 0.173 8
0.163375 0.815875 0.27075 0.311375 0.1175 9
Integrated
OKM
algorithm
3 Harmonic Mean method 1,329,685 0.167 0.92 0.283 0.438 0.175 3
Average
41. Comparison of Results:
Diabetes Disease Dataset
‘8 different scenarios where 4 users enter random values in 8 different ways’ V/S the ‘integrated OKM algorithm
R
A
N
D
O
M
INPUT
S
ROPOSED
OKM
User ValueofK InitialCentorid position
ValueofInitial
Centroid position
precision Recall
F-
measure
Rand
Index
Bcubed
Precision
#Iterations
User 1_R 2 Randomlygenerated NA 0.117 0.895 0.206 0.171 0.111 19
User 1_U 2 Random Inputvaluebyuser 54,678 0.117 0.895 0.206 0.171 0.111 18
User 2_R 3 Randomlygenerated NA 0.11 0.742 0.192 0.248 0.099 14
User 2_U 3 Random Inputvaluebyuser 16,390,481 0.116 0.815 0.203 0.229 0.095 25
User 3_R 4 Randomlygenerated NA 0.115 0.656 0.196 0.352 0.114 25
User 3_U 4 Random Inputvaluebyuser 4,317,590,712 0.111 0.782 0.194 0.217 0.098 29
User 4_R 5 Randomlygenerated NA 0.11 0.61 0.187 0.361 0.108 23
User 4_U 5 Random Inputvaluebyuser 5,18,386,495,600 0.113 0.721 0.195 0.283 0.105 38
0.113625 0.7645 0.197375 0.254 0.105125 23.875
Integrated OKM
Method
3 HarmonicMean method 1,254,524 0.131 0.802 0.226 0.337 0.099 8
Average
42.
43. Graphs for Comparison of Results: Diabetes Disease Data Set (1)
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
precision
precision
Proposed OKM algorithm 0.131
User 4_U 0.113
User 4_R 0.11
User 3_U 0.111
User 3_R 0.115
User 2_U 0.116
User 2_R 0.11
User 1_U 0.117
User 1_R 0.117
PRECISON
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Recall
Proposed OKM algorithm 0.802
User 4_U 0.721
User 4_R 0.61
User 3_U 0.782
User 3_R 0.656
User 2_U 0.815
User 2_R 0.742
User 1_U 0.895
User 1_R 0.895
RECALL
44. Graphs for Comparison of Results: Diabetes Disease Data Set (2)
0 0.05 0.1 0.15 0.2 0.25
F-measure
F-measure
Proposed OKM algorithm 0.226
User 4_U 0.195
User 4_R 0.187
User 3_U 0.194
User 3_R 0.196
User 2_U 0.203
User 2_R 0.192
User 1_U 0.206
User 1_R 0.206
F MEASURE
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Rand Index
Rand Index
Proposed OKM algorithm 0.337
User 4_U 0.283
User 4_R 0.361
User 3_U 0.217
User 3_R 0.352
User 2_U 0.229
User 2_R 0.248
User 1_U 0.171
User 1_R 0.171
RAND INDEX
45. Graphs for Comparison of Results: Diabetes Disease Data Set (3)
0 0.02 0.04 0.06 0.08 0.1 0.12
Bcubed Precision
Bcubed Precision
Proposed OKM algorithm 0.099
User 4_U 0.105
User 4_R 0.108
User 3_U 0.098
User 3_R 0.114
User 2_U 0.095
User 2_R 0.099
User 1_U 0.111
User 1_R 0.111
B CUBED PRECISION
0 5 10 15 20 25 30 35 40
# Iterations
# Iterations
Proposed OKM algorithm 8
User 4_U 38
User 4_R 23
User 3_U 29
User 3_R 25
User 2_U 25
User 2_R 14
User 1_U 18
User 1_R 19
No. OF ITERATIONS
46. Conclusion
The thesis was robust enough to show positive results as it: (i) removes the dependency of the
method from any random input parameters but also (ii) normalizes the outliers .
From the above results we found that, barring one or two accuracy measures, the performance of
the Proposed integrated OKM tool is better than usual OKM method.
We can also observe that integrated OKM helps us in reducing the time complexity in both
cases as the number of integrations are reducing greatly.
As far as future work is concerned, this thesis provides a base for further research on effective
improved clustering which can create a long lasting positive impact on medical field and many
other fields.
47. Bibliography
PAPERS:
1. Argenis A. Aroche-Villarruel1, J.A. Carrasco-Ochoa1, José Fco. Martínez-Trinidad1,J. Arturo Olvera-López2, and Airel Pérez-Suárez3, “Study of
Overlapping Clustering Algorithms Based on Kmeans through FBcubed Metric”, Springer International Publishing Switzerland 2014
2. A.Dharmarajan, T. Velmurugan, “Applications of Partition based Clustering Algorithms: A Survey” 2013 IEEE
3. Jaskaranjit Kaur and Harpreet Singh, “Performance Evaluation of a Novel Hybrid Clustering Algorithm using Birch and K-Means” 2015 IEEE
4. Sandeep Kaur and Dr. Sheetal Kalra, “Disease Prediction using Hybrid K-means and Support Vector Machine” 2016 IEEE
5. Preeti Arora, Deepali Virmani, Himanshu Jindal and Mritunjaya Sharma, “Sorted K-Means Towards the Enhancement of K-Means to Form Stable Clusters”,
Proceedings of International Conference on Communication and Networks, Springer 2017
6. N. Nidheesh, K.A. Abdul Nazeer, P.M. Ameer, ” An enhanced deterministic K-means clustering algorithm for cancer subtype prediction from gene
expression data”, Computers in Biology and Medicine 2017 Elsevier
7. Sina Khanmohammadi, Naiier Adibeig, Samaneh Shanehbandy, “An Improved Overlapping k-Means Clustering Method for Medical Applications”, Expert
Systems With Applications 2016 Elsevier
8. Hailong Chen, Chunli LiuZahid, “Research and Application of Cluster Analysis Algorithm”. 2nd International Conference on Measurement, Information
and Control, 2013 IEEE
9. Shraddha Shukla and Naganna S, “A Review ON K-means DATA Clustering APPROACH” International Journal of Information & Computation Technology
2014
10. L.V. Bijuraj, “Clustering and its applications”. Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
48. Bibliography
1. Pankaj Saxena and Sushma Lehri, “Analysis of various clustering algorithms of data mining on Health informatics”. International Journal of Computer &
Communication Technology 2013
2. K.Rajalakshmi1, Dr.S.S.Dhenakaran2, N.Roobini, “Comparative Analysis of K-Means Algorithm in Disease Prediction” International Journal of Science,
Engineering and Technology Research (IJSETR), July 2015
3. Amit Saxena , Mukesh Prasad , Akshansh Gupta , Neha Bharill ,Om Prakash Patel , Aruna Tiwari , Meng Joo Er , Weiping Ding ,Chin-Teng Lin, ” A Review
of Clustering Techniques and Developments”. 2017 Elsevier
4. Guillaume Cleuziou, “An extended version of the k-means method for overlapping clustering” 2008 IEEE
WEBSITES
1. https://en.wikipedia.org/wiki/Cluster_analysis#Applications
2. http://stp.lingfil.uu.se/~santinim/ml/2016/Lect_10/10c_UnsupervisedMethods.pdf
3. https://en.wikipedia.org/wiki/K-means_clustering
4. https://www.jstatsoft.org/article/view/v050i10
5. https://en.wikipedia.org/wiki/Silhouette_(clustering)
6. https://en.wikipedia.org/wiki/Correlation_clustering
7. http://www.francescobonchi.com/CCtuto_kdd14.pdf