SlideShare una empresa de Scribd logo
1 de 50
Descargar para leer sin conexión
Implementation of Integrated Approach of
K-means Clustering Algorithm for
Prediction Analysis
For Partial Fulfillment for the Degree to be
awarded by
Gujarat Technological University
Presented by
Manisha Goyal(160130702006)
Carried Out at
Government Engineering College, Gandhinagar
Under the Supervision of
Prof. M.B. Chaudhari
Dissertation Phase-II Presentation
On
Layout
 Motivation and Objective of research work
 Theoretical Background
 Literature Review and Comparative Study
 Problem Identification
 Existing v/s Proposed Methodology
 DP-1 and MSR Comments with Solutions
 Implementation of Proposed Work
 Results Analysis
 Conclusion
 Bibliography
 Paper Publication Certificate
Motivation of Work
K-means is very old concept which is increasing in popularity day by day because of its simplicity and
linear time complexity. However, it has two main disadvantages :- 1) Its highly sensitive to outlier and 2)
Its highly dependent on initialization parameters (random choice of k clusters and position of initial cluster
centroids). Many improved variants of K-means method are detailed in literature but still it is an open field
of research because of its extensive application in the field of Medical, Business & Marketing, Social
Media –Sentiment Analysis etc.
Overlapping K-means is an extended version of K –means, which is fairly a new concept and widely using
in various field where overlapping clusters are require. As it is an extension of K-means, it also needs
improvements. There is a lot of scope for improvement as far its accuracy and dependability is
concerned.
Objective
“The goal of this research work is to improve the accuracy of existing overlapping K-means
Clustering by removing its dependency on initialization parameters (random choice of k clusters
and placement of initial cluster centroids) and to evaluate the results using different measures for
different applications”.
To achieve this objective, the proposed algorithm performs the following steps: -
1) Preprocessing of raw dataset.
2) Calculate the optimum value of K (entirely based on the dataset, NOT as input from user).
3) Find position of initial centroids (using Proposed Harmonic Means method, not as random input
from user) and then using above results apply OKM.
Chapter 1
Theoretical Background
Clustering
• Objective: To find natural groupings among objects
• It is an unsupervised learning problem which deals with finding a structure in a collection of
unlabelled data.
• Organize data in such a way that there is a
Clustering category based on
generated clusters
1. Exclusive (Non-overlapping) Clustering
2. Overlapping Clustering
Why Overlapping Clustering??
Most of existing clustering methods assume that each data observation belongs to one and only
one cluster leading to k disjoint clusters explaining the data. However, in many applications the
data being modeled can have a much richer and more complex hidden representation where
observations actually belong to multiple clusters.
•In Social Network Analysis, community extraction algorithms need to detect overlapping clusters
where an actor can belong to multiple communities..
• In Text Clustering, learning methods should be able to group document, which discuss more than
one topic, into several groups.
•In Medical Domain, various diseases share some common overlapping symptoms such as fever is
common symptom in typhoid, malaria, viral infection and many others.
K-means clustering
• Partitioning method for clustering.
• Objective: takes the input parameter, k, and partitions a set of n objects into k clusters.
• Dissimilarity measures of K-means are:
Euclidean Distance
Manhattan Distance
• Cluster mean is used to update the centroid of that cluster.
• The aim of K-means is to minimize the objective function or the square-error criterion, defined as:
𝐸 = 𝑖=1
𝑘
𝑝∈𝑐𝑖
𝑝 − 𝑚𝑖
2
Where E is the sum of the square error for all objects in the data set; p is the point in space representing a
given object; and mi is the mean of cluster Ci (both p and mi are multidimensional).
How does K-means Work??
1. Initialization:
• Randomly Choose cluster centroid for K=2
2. Cluster Assignment:
• Compute the distance between the data points and the cluster centroid by using dissimilarity
measures.
• Depending upon the minimum distance, data points are divided into 2 clusters.
3. Move Centroid:
• Compute the mean of blue dots and reposition blue centroid to this mean
• Compute the mean of orange dots and reposition orange centroid to this mean
4. Optimization and Convergence:
• Repeat previous 2 steps iteratively till the cluster centroid stop changing their position.
• At some point cluster does not change for further computation that is the point when algorithm
converges
• Below is the final cluster.
Advantages and Disadvantages
Advantages:
• Easy to implement and robust.
• Relatively scalable and efficient in processing large data sets with linear time complexity.
• Produce tighter clusters than hierarchical clustering.
Disadvantages:
• Applied only when the mean of a cluster is defined.
• Cannot be applied on categorical attributes.
• Sensitive to the selection of number of a clusters k and initial cluster center.
• Not suitable for discovering clusters with nonconvex shapes or clusters of very different size.
• Sensitive to noise and outlier data points.
Overlapping K-means (OKM)
• The OKM method extends the objective function used in K-means to consider the possibility of
overlapping clusters.
• The K-means algorithm aims at clustering 𝑋 = 𝑥𝑖, … . . 𝑥 𝑛 into k clusters by minimizing the
following objective function
𝑄 𝜋 = 𝑗=1
𝑘
𝑥 𝑖∈𝜋 𝑗
𝑥𝑖 − 𝑧𝑗
2
Where, 𝑥𝑖 is a ʋ-dimensional set of observations, 𝜋 = { 𝜋1, … … 𝜋 𝑘 } is the set of k clusters (𝜋𝑖 ∩
𝜋𝑗 = ∅), and 𝑍 = {𝑧1, … . 𝑧 𝑘} is the set of cluster centroids.
Cont…
• The OKM approach relaxes the objective function of K-means to allow overlapping by removing
the constraint 𝜋𝑖 ∩ 𝜋𝑗 = ∅, for 𝑖 ≠ 𝑗.
• Objective function of OKM is defined as
𝑄′ 𝜋 = 𝑗=1
𝑛
𝑥𝑖 − ɸ 𝑥𝑖
2
• The ɸ 𝑥𝑖 = (ɸ1 𝑥𝑖 , … . ɸ 𝑟 𝑥𝑖 ) is the representation of 𝑥𝑖 also called ‘image’ or ‘barycenter
of cluster's’ of 𝑥𝑖 defined as a combination of the centroids 𝑧𝑗 of the clusters 𝜋𝑗 where 𝑥𝑖 belongs
to, computed as
ɸ 𝑝 𝑥𝑖 =
𝑧 𝑗 𝜖𝜋(𝑥 𝑖) 𝑧 𝑗
|𝜋 𝑥 𝑖 |
,
Cont…
Here, the centroid 𝑧𝑗 ∈ 𝜋(𝑥𝑖) , where 𝜋(𝑥𝑖) is the list of all clusters that 𝑥𝑖 belongs to. The
centroid 𝑧𝑗 is updated using the following equation:
𝑧𝑗
∗
=
1
𝑥 𝑖∈𝜋 𝑗
1
𝛿 𝑖
2
𝑥 𝑖 𝜖𝜋 𝑗
1
𝛿 𝑖
2 (𝛿𝑖 × 𝑥𝑖 − 𝑧 𝑗∈𝜋(𝑥 𝑖)/𝑧 𝑗
𝑧𝑗),
Where 𝛿𝑖 is the total number of clusters that 𝑥𝑖 belongs to (in this case 𝛿𝑖 = |𝜋 𝑥𝑖 |).
Evaluation metrics
1. Sum of Square Error (SSE)
2. between_ss/total_ss Ratio
3. Number of Iterations
4. F-Measures and FBCubed Measures
5. Rand Index
Chapter 2
Literature Survey
Literature Review and Comparative Study
Title Publication
1. Applications of Partition based Clustering Algorithms: A Survey IEEE 2013
2. Performance Evaluation of a Novel Hybrid Clustering Algorithm using Birch and K-
Means
IEEE 2015
3. Disease Prediction using Hybrid K-means and Support Vector Machine IEEE 2016
4. Sorted K-Means Towards the Enhancement of K-Means to Form Stable Clusters Springer 2017
5. An enhanced deterministic K-means clustering algorithm for cancer subtype prediction
from gene expression data
Elsevier 2017
6. An Improved Overlapping k-Means Clustering Method for Medical Applications Elsevier 2017
Title Techniques used Strength Weakness
1. Performance Evaluation of
a Novel Hybrid Clustering
Algorithm using Birch and
K-Means
K-means clustering
algorithm and BIRCH
 Better performance than K-
Means and K-Medoid
clustering
 It can handle large datasets
more effectively.
 Results are vary with k
values
 Computation time can be
reduced further
2. Disease Prediction using
Hybrid K-means and
Support Vector Machine
hybrid K-means algorithm
which uses the silhouette
values to find k values
initial centroids and Support
Vector Machine algorithm
The K-means achieved the
accuracy of 82% and the hybrid
algorithm achieved the accuracy
of 92% on the same dataset
Accuracy can further be improved
by using improved k-Means
algorithm and SVM.
3. Sorted K-Means Towards
the Enhancement of K-
Means to Form Stable
Clusters
Sorted (Merge or Quick
sort) K-Means which
determines initial centroids
Effectively and efficiently used
to form stable clusters with less
number of iterations.
Space and Time complexity for
big data can be improved further
Title Techniques used Strength Weakness
4. An enhanced
deterministic K-means
clustering algorithm for
cancer subtype prediction
from gene expression data
density based version of
K-Means
The overall performance to
the others compared
algorithms
It does not deal with outliers
5. An Improved
Overlapping k-Means
Clustering Method for
Medical Applications
k-harmonic means and
overlapping k-means
algorithms (KHMOKM)
 Better performance than
OKM
 Better minimization of
objective function
 Rely on the Euclidean
distance.
 The algorithm depends on
the initial selection of the
number of clusters k.
Chapter 3
Problem Identification
Some of the following problems, identified during literature review are as follows.
1. Computation time of integrated approach is a big issue because integrating two approaches increases
time complexity which is not acceptable. [paper2]
2. Some algorithms do not deal properly with outliers which decreases the accuracy overall for large
datasets. [paper5]
3. Some algorithms do not work well with large datasets because of space complexity issue. [paper3]
4. Its increases the complexity of the algorithm when integrated KHM-OKM method is used. [paper6]
5. Most algorithms rely on Euclidean distances to find closest distances from centroids which is not
suitable for all types of datasets.
Chapter 4
Existing v/s Proposed Methodology
Chapter 5
DP-1 and MSR Comments
Sr. NO. DP1 Comments by External Status
1. Good Literature Review
2. Detailed Algorithm needs to be prepared. Done in MSR
3. Evaluate Complexity of your proposed approach Done in MSR
Sr. NO. MSR Comments by External Status
1. Proposed Algorithm needs to be implemented with
sufficiently large dataset
Done
2. Implementation and results of work should be displayed
during DP-2
Done
MSR Comments with solution
Comment 1: Proposed Algorithm needs to be implemented with sufficiently large dataset.
Solution- We have taken following two large datasets ( Lung Cancer and Diabetes Disease):
Sr. No. Dataset Name Size
1. Lung Cancer Dataset 1000*25
2. Diabetes Disease Dataset 768*9
Comment 2: Implementation and results of work should be displayed during DP-2
Solution: Implementation of Proposed algorithm and Results are described in succeeding
sections.
Lung Cancer Dataset
Diabetes Disease Dataset
Chapter 6
Implementation of Proposed Work
1. Coding and Analysis is done in RStudio
2. Implementation of Integrated OKM in Weka4OC toolbox
3. Recording and analyzing Results in Excel sheet 2013
Step 1: Import Dataset in R Studio and Analysis on it (Lung Cancer Dataset).
Step 2: Apply Methods to find K- values.
Step 3: Apply Proposed Harmonic Mean Method to find Initial Centroids.
Cont…(step 3)
Chapter 7
Result Analysis
(Original OKM v/s Integrated OKM)
1. Existing Methodology (original OKM)
Random users inputs
Run the above random inputs in Weka4Oc tool
2. Proposed Integrated OKM methodology
Step 1: Determine appropriate k-value through algorithm
Step 2: Calculation of initial centroid position through proposed harmonic means method
Step 3: Run the inputs generated through above algorithm in Weka4Oc
Proposed OKM Methodology
Step 1: Determine appropriate K Value through algorithm
Ratio of between _SS/Total_SS
Methods to find K Lung Cancer Dataset Diabetes Data Set
K=3_ Elbow Method 50.70% 74.90%
K=2_Silhoutte Method 39.30% 55.70%
K=1_Gap static Method 0.00% 0.00%
Step 2 : Calculation of initial centroid position through Proposed Harmonic
Means (post pre-processing, if required)
Step 3: Input best value of K and initial centroid position calculated from above
step in Weka4Oc
Comparison of Results:
Lung Cancer Dataset
‘8 different scenarios where 4 users enter random values in 8 different ways’ V/S the ‘integrated OKM algorithm
R
A
N
D
O
M
INPUTS
PROPOSED
OKM
User
Value
of K
Initial Centorid position
Value of Initial
Centroid position
precision Recall
F-
measure
Rand
Index
Bcubed
Precision
#
Iteration
s
User 1_R 2 Randomly generated NA 0.143 0.904 0.247 0.244 0.107 11
User 1_U 2
Random Input value by
user
56,789 0.141 0.911 0.243 0.223 0.114 8
User 2_R 3 Randomly generated NA 0.159 0.777 0.264 0.352 0.116 7
User 2_U 3
Random Input value by
user
45,578,899 0.142 0.798 0.242 0.253 0.116 9
User 3_R 4 Randomly generated NA 0.169 0.852 0.283 0.293 0.1 9
User 3_U 4
Random Input value by
user
23,456,678,890 0.169 0.801 0.278 0.322 0.113 11
User 4_R 5 Randomly generated NA 0.188 0.853 0.309 0.325 0.101 9
User 4_U 5
Random Input value by
user
23, 456, 658, 123,
897
0.196 0.631 0.3 0.479 0.173 8
0.163375 0.815875 0.27075 0.311375 0.1175 9
Integrated
OKM
algorithm
3 Harmonic Mean method 1,329,685 0.167 0.92 0.283 0.438 0.175 3
Average
Graphs for Comparison of Results: Lung Cancer Data Set (1)
Graphs for Comparison of Results: Lung Cancer Data Set (2)
Graphs for Comparison of Results: Lung Cancer Data Set (3)
Comparison of Results:
Diabetes Disease Dataset
‘8 different scenarios where 4 users enter random values in 8 different ways’ V/S the ‘integrated OKM algorithm
R
A
N
D
O
M
INPUT
S
ROPOSED
OKM
User ValueofK InitialCentorid position
ValueofInitial
Centroid position
precision Recall
F-
measure
Rand
Index
Bcubed
Precision
#Iterations
User 1_R 2 Randomlygenerated NA 0.117 0.895 0.206 0.171 0.111 19
User 1_U 2 Random Inputvaluebyuser 54,678 0.117 0.895 0.206 0.171 0.111 18
User 2_R 3 Randomlygenerated NA 0.11 0.742 0.192 0.248 0.099 14
User 2_U 3 Random Inputvaluebyuser 16,390,481 0.116 0.815 0.203 0.229 0.095 25
User 3_R 4 Randomlygenerated NA 0.115 0.656 0.196 0.352 0.114 25
User 3_U 4 Random Inputvaluebyuser 4,317,590,712 0.111 0.782 0.194 0.217 0.098 29
User 4_R 5 Randomlygenerated NA 0.11 0.61 0.187 0.361 0.108 23
User 4_U 5 Random Inputvaluebyuser 5,18,386,495,600 0.113 0.721 0.195 0.283 0.105 38
0.113625 0.7645 0.197375 0.254 0.105125 23.875
Integrated OKM
Method
3 HarmonicMean method 1,254,524 0.131 0.802 0.226 0.337 0.099 8
Average
Graphs for Comparison of Results: Diabetes Disease Data Set (1)
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
precision
precision
Proposed OKM algorithm 0.131
User 4_U 0.113
User 4_R 0.11
User 3_U 0.111
User 3_R 0.115
User 2_U 0.116
User 2_R 0.11
User 1_U 0.117
User 1_R 0.117
PRECISON
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Recall
Proposed OKM algorithm 0.802
User 4_U 0.721
User 4_R 0.61
User 3_U 0.782
User 3_R 0.656
User 2_U 0.815
User 2_R 0.742
User 1_U 0.895
User 1_R 0.895
RECALL
Graphs for Comparison of Results: Diabetes Disease Data Set (2)
0 0.05 0.1 0.15 0.2 0.25
F-measure
F-measure
Proposed OKM algorithm 0.226
User 4_U 0.195
User 4_R 0.187
User 3_U 0.194
User 3_R 0.196
User 2_U 0.203
User 2_R 0.192
User 1_U 0.206
User 1_R 0.206
F MEASURE
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Rand Index
Rand Index
Proposed OKM algorithm 0.337
User 4_U 0.283
User 4_R 0.361
User 3_U 0.217
User 3_R 0.352
User 2_U 0.229
User 2_R 0.248
User 1_U 0.171
User 1_R 0.171
RAND INDEX
Graphs for Comparison of Results: Diabetes Disease Data Set (3)
0 0.02 0.04 0.06 0.08 0.1 0.12
Bcubed Precision
Bcubed Precision
Proposed OKM algorithm 0.099
User 4_U 0.105
User 4_R 0.108
User 3_U 0.098
User 3_R 0.114
User 2_U 0.095
User 2_R 0.099
User 1_U 0.111
User 1_R 0.111
B CUBED PRECISION
0 5 10 15 20 25 30 35 40
# Iterations
# Iterations
Proposed OKM algorithm 8
User 4_U 38
User 4_R 23
User 3_U 29
User 3_R 25
User 2_U 25
User 2_R 14
User 1_U 18
User 1_R 19
No. OF ITERATIONS
Conclusion
The thesis was robust enough to show positive results as it: (i) removes the dependency of the
method from any random input parameters but also (ii) normalizes the outliers .
From the above results we found that, barring one or two accuracy measures, the performance of
the Proposed integrated OKM tool is better than usual OKM method.
 We can also observe that integrated OKM helps us in reducing the time complexity in both
cases as the number of integrations are reducing greatly.
As far as future work is concerned, this thesis provides a base for further research on effective
improved clustering which can create a long lasting positive impact on medical field and many
other fields.
Bibliography
PAPERS:
1. Argenis A. Aroche-Villarruel1, J.A. Carrasco-Ochoa1, José Fco. Martínez-Trinidad1,J. Arturo Olvera-López2, and Airel Pérez-Suárez3, “Study of
Overlapping Clustering Algorithms Based on Kmeans through FBcubed Metric”, Springer International Publishing Switzerland 2014
2. A.Dharmarajan, T. Velmurugan, “Applications of Partition based Clustering Algorithms: A Survey” 2013 IEEE
3. Jaskaranjit Kaur and Harpreet Singh, “Performance Evaluation of a Novel Hybrid Clustering Algorithm using Birch and K-Means” 2015 IEEE
4. Sandeep Kaur and Dr. Sheetal Kalra, “Disease Prediction using Hybrid K-means and Support Vector Machine” 2016 IEEE
5. Preeti Arora, Deepali Virmani, Himanshu Jindal and Mritunjaya Sharma, “Sorted K-Means Towards the Enhancement of K-Means to Form Stable Clusters”,
Proceedings of International Conference on Communication and Networks, Springer 2017
6. N. Nidheesh, K.A. Abdul Nazeer, P.M. Ameer, ” An enhanced deterministic K-means clustering algorithm for cancer subtype prediction from gene
expression data”, Computers in Biology and Medicine 2017 Elsevier
7. Sina Khanmohammadi, Naiier Adibeig, Samaneh Shanehbandy, “An Improved Overlapping k-Means Clustering Method for Medical Applications”, Expert
Systems With Applications 2016 Elsevier
8. Hailong Chen, Chunli LiuZahid, “Research and Application of Cluster Analysis Algorithm”. 2nd International Conference on Measurement, Information
and Control, 2013 IEEE
9. Shraddha Shukla and Naganna S, “A Review ON K-means DATA Clustering APPROACH” International Journal of Information & Computation Technology
2014
10. L.V. Bijuraj, “Clustering and its applications”. Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
Bibliography
1. Pankaj Saxena and Sushma Lehri, “Analysis of various clustering algorithms of data mining on Health informatics”. International Journal of Computer &
Communication Technology 2013
2. K.Rajalakshmi1, Dr.S.S.Dhenakaran2, N.Roobini, “Comparative Analysis of K-Means Algorithm in Disease Prediction” International Journal of Science,
Engineering and Technology Research (IJSETR), July 2015
3. Amit Saxena , Mukesh Prasad , Akshansh Gupta , Neha Bharill ,Om Prakash Patel , Aruna Tiwari , Meng Joo Er , Weiping Ding ,Chin-Teng Lin, ” A Review
of Clustering Techniques and Developments”. 2017 Elsevier
4. Guillaume Cleuziou, “An extended version of the k-means method for overlapping clustering” 2008 IEEE
WEBSITES
1. https://en.wikipedia.org/wiki/Cluster_analysis#Applications
2. http://stp.lingfil.uu.se/~santinim/ml/2016/Lect_10/10c_UnsupervisedMethods.pdf
3. https://en.wikipedia.org/wiki/K-means_clustering
4. https://www.jstatsoft.org/article/view/v050i10
5. https://en.wikipedia.org/wiki/Silhouette_(clustering)
6. https://en.wikipedia.org/wiki/Correlation_clustering
7. http://www.francescobonchi.com/CCtuto_kdd14.pdf
Paper Publication Certificate
Master's Thesis Presentation

Más contenido relacionado

La actualidad más candente

Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Simplilearn
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
Simplilearn
 
History of Induction and Recursion B
History of Induction and Recursion B History of Induction and Recursion B
History of Induction and Recursion B
Damien MacFarland
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
mourya chandra
 

La actualidad más candente (20)

Data clustering
Data clustering Data clustering
Data clustering
 
Deep Semi-supervised Learning methods
Deep Semi-supervised Learning methodsDeep Semi-supervised Learning methods
Deep Semi-supervised Learning methods
 
Kmeans
KmeansKmeans
Kmeans
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Post quantum cryptography
Post quantum cryptographyPost quantum cryptography
Post quantum cryptography
 
KNN
KNNKNN
KNN
 
History of Induction and Recursion B
History of Induction and Recursion B History of Induction and Recursion B
History of Induction and Recursion B
 
K-Means manual work
K-Means manual workK-Means manual work
K-Means manual work
 
Chapter8
Chapter8Chapter8
Chapter8
 
Anomaly detection in plain static graphs
Anomaly detection in plain static graphsAnomaly detection in plain static graphs
Anomaly detection in plain static graphs
 
K means clustering | K Means ++
K means clustering | K Means ++K means clustering | K Means ++
K means clustering | K Means ++
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
 
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVM
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
 
Ensemble methods
Ensemble methods Ensemble methods
Ensemble methods
 

Similar a Master's Thesis Presentation

Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
refedey275
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemEnhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Anders Viken
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
Junghoon Kim
 

Similar a Master's Thesis Presentation (20)

A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
K means report
K means reportK means report
K means report
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
A046010107
A046010107A046010107
A046010107
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
Max stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problemMax stable set problem to found the initial centroids in clustering problem
Max stable set problem to found the initial centroids in clustering problem
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
Noura2
Noura2Noura2
Noura2
 
k-mean-clustering.pdf
k-mean-clustering.pdfk-mean-clustering.pdf
k-mean-clustering.pdf
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGA HYBRID CLUSTERING ALGORITHM FOR DATA MINING
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
 
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemEnhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 

Último

+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
Health
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 

Último (20)

Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 

Master's Thesis Presentation

  • 1. Implementation of Integrated Approach of K-means Clustering Algorithm for Prediction Analysis For Partial Fulfillment for the Degree to be awarded by Gujarat Technological University Presented by Manisha Goyal(160130702006) Carried Out at Government Engineering College, Gandhinagar Under the Supervision of Prof. M.B. Chaudhari Dissertation Phase-II Presentation On
  • 2. Layout  Motivation and Objective of research work  Theoretical Background  Literature Review and Comparative Study  Problem Identification  Existing v/s Proposed Methodology  DP-1 and MSR Comments with Solutions  Implementation of Proposed Work  Results Analysis  Conclusion  Bibliography  Paper Publication Certificate
  • 3. Motivation of Work K-means is very old concept which is increasing in popularity day by day because of its simplicity and linear time complexity. However, it has two main disadvantages :- 1) Its highly sensitive to outlier and 2) Its highly dependent on initialization parameters (random choice of k clusters and position of initial cluster centroids). Many improved variants of K-means method are detailed in literature but still it is an open field of research because of its extensive application in the field of Medical, Business & Marketing, Social Media –Sentiment Analysis etc. Overlapping K-means is an extended version of K –means, which is fairly a new concept and widely using in various field where overlapping clusters are require. As it is an extension of K-means, it also needs improvements. There is a lot of scope for improvement as far its accuracy and dependability is concerned.
  • 4. Objective “The goal of this research work is to improve the accuracy of existing overlapping K-means Clustering by removing its dependency on initialization parameters (random choice of k clusters and placement of initial cluster centroids) and to evaluate the results using different measures for different applications”. To achieve this objective, the proposed algorithm performs the following steps: - 1) Preprocessing of raw dataset. 2) Calculate the optimum value of K (entirely based on the dataset, NOT as input from user). 3) Find position of initial centroids (using Proposed Harmonic Means method, not as random input from user) and then using above results apply OKM.
  • 6. Clustering • Objective: To find natural groupings among objects • It is an unsupervised learning problem which deals with finding a structure in a collection of unlabelled data. • Organize data in such a way that there is a
  • 7. Clustering category based on generated clusters 1. Exclusive (Non-overlapping) Clustering 2. Overlapping Clustering
  • 8. Why Overlapping Clustering?? Most of existing clustering methods assume that each data observation belongs to one and only one cluster leading to k disjoint clusters explaining the data. However, in many applications the data being modeled can have a much richer and more complex hidden representation where observations actually belong to multiple clusters. •In Social Network Analysis, community extraction algorithms need to detect overlapping clusters where an actor can belong to multiple communities.. • In Text Clustering, learning methods should be able to group document, which discuss more than one topic, into several groups. •In Medical Domain, various diseases share some common overlapping symptoms such as fever is common symptom in typhoid, malaria, viral infection and many others.
  • 9. K-means clustering • Partitioning method for clustering. • Objective: takes the input parameter, k, and partitions a set of n objects into k clusters. • Dissimilarity measures of K-means are: Euclidean Distance Manhattan Distance • Cluster mean is used to update the centroid of that cluster. • The aim of K-means is to minimize the objective function or the square-error criterion, defined as: 𝐸 = 𝑖=1 𝑘 𝑝∈𝑐𝑖 𝑝 − 𝑚𝑖 2 Where E is the sum of the square error for all objects in the data set; p is the point in space representing a given object; and mi is the mean of cluster Ci (both p and mi are multidimensional).
  • 10. How does K-means Work?? 1. Initialization: • Randomly Choose cluster centroid for K=2
  • 11. 2. Cluster Assignment: • Compute the distance between the data points and the cluster centroid by using dissimilarity measures. • Depending upon the minimum distance, data points are divided into 2 clusters.
  • 12. 3. Move Centroid: • Compute the mean of blue dots and reposition blue centroid to this mean • Compute the mean of orange dots and reposition orange centroid to this mean
  • 13. 4. Optimization and Convergence: • Repeat previous 2 steps iteratively till the cluster centroid stop changing their position. • At some point cluster does not change for further computation that is the point when algorithm converges • Below is the final cluster.
  • 14. Advantages and Disadvantages Advantages: • Easy to implement and robust. • Relatively scalable and efficient in processing large data sets with linear time complexity. • Produce tighter clusters than hierarchical clustering. Disadvantages: • Applied only when the mean of a cluster is defined. • Cannot be applied on categorical attributes. • Sensitive to the selection of number of a clusters k and initial cluster center. • Not suitable for discovering clusters with nonconvex shapes or clusters of very different size. • Sensitive to noise and outlier data points.
  • 15. Overlapping K-means (OKM) • The OKM method extends the objective function used in K-means to consider the possibility of overlapping clusters. • The K-means algorithm aims at clustering 𝑋 = 𝑥𝑖, … . . 𝑥 𝑛 into k clusters by minimizing the following objective function 𝑄 𝜋 = 𝑗=1 𝑘 𝑥 𝑖∈𝜋 𝑗 𝑥𝑖 − 𝑧𝑗 2 Where, 𝑥𝑖 is a ʋ-dimensional set of observations, 𝜋 = { 𝜋1, … … 𝜋 𝑘 } is the set of k clusters (𝜋𝑖 ∩ 𝜋𝑗 = ∅), and 𝑍 = {𝑧1, … . 𝑧 𝑘} is the set of cluster centroids.
  • 16. Cont… • The OKM approach relaxes the objective function of K-means to allow overlapping by removing the constraint 𝜋𝑖 ∩ 𝜋𝑗 = ∅, for 𝑖 ≠ 𝑗. • Objective function of OKM is defined as 𝑄′ 𝜋 = 𝑗=1 𝑛 𝑥𝑖 − ɸ 𝑥𝑖 2 • The ɸ 𝑥𝑖 = (ɸ1 𝑥𝑖 , … . ɸ 𝑟 𝑥𝑖 ) is the representation of 𝑥𝑖 also called ‘image’ or ‘barycenter of cluster's’ of 𝑥𝑖 defined as a combination of the centroids 𝑧𝑗 of the clusters 𝜋𝑗 where 𝑥𝑖 belongs to, computed as ɸ 𝑝 𝑥𝑖 = 𝑧 𝑗 𝜖𝜋(𝑥 𝑖) 𝑧 𝑗 |𝜋 𝑥 𝑖 | ,
  • 17. Cont… Here, the centroid 𝑧𝑗 ∈ 𝜋(𝑥𝑖) , where 𝜋(𝑥𝑖) is the list of all clusters that 𝑥𝑖 belongs to. The centroid 𝑧𝑗 is updated using the following equation: 𝑧𝑗 ∗ = 1 𝑥 𝑖∈𝜋 𝑗 1 𝛿 𝑖 2 𝑥 𝑖 𝜖𝜋 𝑗 1 𝛿 𝑖 2 (𝛿𝑖 × 𝑥𝑖 − 𝑧 𝑗∈𝜋(𝑥 𝑖)/𝑧 𝑗 𝑧𝑗), Where 𝛿𝑖 is the total number of clusters that 𝑥𝑖 belongs to (in this case 𝛿𝑖 = |𝜋 𝑥𝑖 |).
  • 18. Evaluation metrics 1. Sum of Square Error (SSE) 2. between_ss/total_ss Ratio 3. Number of Iterations 4. F-Measures and FBCubed Measures 5. Rand Index
  • 20. Literature Review and Comparative Study Title Publication 1. Applications of Partition based Clustering Algorithms: A Survey IEEE 2013 2. Performance Evaluation of a Novel Hybrid Clustering Algorithm using Birch and K- Means IEEE 2015 3. Disease Prediction using Hybrid K-means and Support Vector Machine IEEE 2016 4. Sorted K-Means Towards the Enhancement of K-Means to Form Stable Clusters Springer 2017 5. An enhanced deterministic K-means clustering algorithm for cancer subtype prediction from gene expression data Elsevier 2017 6. An Improved Overlapping k-Means Clustering Method for Medical Applications Elsevier 2017
  • 21. Title Techniques used Strength Weakness 1. Performance Evaluation of a Novel Hybrid Clustering Algorithm using Birch and K-Means K-means clustering algorithm and BIRCH  Better performance than K- Means and K-Medoid clustering  It can handle large datasets more effectively.  Results are vary with k values  Computation time can be reduced further 2. Disease Prediction using Hybrid K-means and Support Vector Machine hybrid K-means algorithm which uses the silhouette values to find k values initial centroids and Support Vector Machine algorithm The K-means achieved the accuracy of 82% and the hybrid algorithm achieved the accuracy of 92% on the same dataset Accuracy can further be improved by using improved k-Means algorithm and SVM. 3. Sorted K-Means Towards the Enhancement of K- Means to Form Stable Clusters Sorted (Merge or Quick sort) K-Means which determines initial centroids Effectively and efficiently used to form stable clusters with less number of iterations. Space and Time complexity for big data can be improved further
  • 22. Title Techniques used Strength Weakness 4. An enhanced deterministic K-means clustering algorithm for cancer subtype prediction from gene expression data density based version of K-Means The overall performance to the others compared algorithms It does not deal with outliers 5. An Improved Overlapping k-Means Clustering Method for Medical Applications k-harmonic means and overlapping k-means algorithms (KHMOKM)  Better performance than OKM  Better minimization of objective function  Rely on the Euclidean distance.  The algorithm depends on the initial selection of the number of clusters k.
  • 23. Chapter 3 Problem Identification Some of the following problems, identified during literature review are as follows. 1. Computation time of integrated approach is a big issue because integrating two approaches increases time complexity which is not acceptable. [paper2] 2. Some algorithms do not deal properly with outliers which decreases the accuracy overall for large datasets. [paper5] 3. Some algorithms do not work well with large datasets because of space complexity issue. [paper3] 4. Its increases the complexity of the algorithm when integrated KHM-OKM method is used. [paper6] 5. Most algorithms rely on Euclidean distances to find closest distances from centroids which is not suitable for all types of datasets.
  • 24. Chapter 4 Existing v/s Proposed Methodology
  • 25. Chapter 5 DP-1 and MSR Comments Sr. NO. DP1 Comments by External Status 1. Good Literature Review 2. Detailed Algorithm needs to be prepared. Done in MSR 3. Evaluate Complexity of your proposed approach Done in MSR Sr. NO. MSR Comments by External Status 1. Proposed Algorithm needs to be implemented with sufficiently large dataset Done 2. Implementation and results of work should be displayed during DP-2 Done
  • 26. MSR Comments with solution Comment 1: Proposed Algorithm needs to be implemented with sufficiently large dataset. Solution- We have taken following two large datasets ( Lung Cancer and Diabetes Disease): Sr. No. Dataset Name Size 1. Lung Cancer Dataset 1000*25 2. Diabetes Disease Dataset 768*9 Comment 2: Implementation and results of work should be displayed during DP-2 Solution: Implementation of Proposed algorithm and Results are described in succeeding sections.
  • 29. Chapter 6 Implementation of Proposed Work 1. Coding and Analysis is done in RStudio 2. Implementation of Integrated OKM in Weka4OC toolbox 3. Recording and analyzing Results in Excel sheet 2013
  • 30. Step 1: Import Dataset in R Studio and Analysis on it (Lung Cancer Dataset).
  • 31. Step 2: Apply Methods to find K- values.
  • 32. Step 3: Apply Proposed Harmonic Mean Method to find Initial Centroids.
  • 34. Chapter 7 Result Analysis (Original OKM v/s Integrated OKM) 1. Existing Methodology (original OKM) Random users inputs Run the above random inputs in Weka4Oc tool 2. Proposed Integrated OKM methodology Step 1: Determine appropriate k-value through algorithm Step 2: Calculation of initial centroid position through proposed harmonic means method Step 3: Run the inputs generated through above algorithm in Weka4Oc
  • 35. Proposed OKM Methodology Step 1: Determine appropriate K Value through algorithm Ratio of between _SS/Total_SS Methods to find K Lung Cancer Dataset Diabetes Data Set K=3_ Elbow Method 50.70% 74.90% K=2_Silhoutte Method 39.30% 55.70% K=1_Gap static Method 0.00% 0.00% Step 2 : Calculation of initial centroid position through Proposed Harmonic Means (post pre-processing, if required) Step 3: Input best value of K and initial centroid position calculated from above step in Weka4Oc
  • 36. Comparison of Results: Lung Cancer Dataset ‘8 different scenarios where 4 users enter random values in 8 different ways’ V/S the ‘integrated OKM algorithm R A N D O M INPUTS PROPOSED OKM User Value of K Initial Centorid position Value of Initial Centroid position precision Recall F- measure Rand Index Bcubed Precision # Iteration s User 1_R 2 Randomly generated NA 0.143 0.904 0.247 0.244 0.107 11 User 1_U 2 Random Input value by user 56,789 0.141 0.911 0.243 0.223 0.114 8 User 2_R 3 Randomly generated NA 0.159 0.777 0.264 0.352 0.116 7 User 2_U 3 Random Input value by user 45,578,899 0.142 0.798 0.242 0.253 0.116 9 User 3_R 4 Randomly generated NA 0.169 0.852 0.283 0.293 0.1 9 User 3_U 4 Random Input value by user 23,456,678,890 0.169 0.801 0.278 0.322 0.113 11 User 4_R 5 Randomly generated NA 0.188 0.853 0.309 0.325 0.101 9 User 4_U 5 Random Input value by user 23, 456, 658, 123, 897 0.196 0.631 0.3 0.479 0.173 8 0.163375 0.815875 0.27075 0.311375 0.1175 9 Integrated OKM algorithm 3 Harmonic Mean method 1,329,685 0.167 0.92 0.283 0.438 0.175 3 Average
  • 37.
  • 38. Graphs for Comparison of Results: Lung Cancer Data Set (1)
  • 39. Graphs for Comparison of Results: Lung Cancer Data Set (2)
  • 40. Graphs for Comparison of Results: Lung Cancer Data Set (3)
  • 41. Comparison of Results: Diabetes Disease Dataset ‘8 different scenarios where 4 users enter random values in 8 different ways’ V/S the ‘integrated OKM algorithm R A N D O M INPUT S ROPOSED OKM User ValueofK InitialCentorid position ValueofInitial Centroid position precision Recall F- measure Rand Index Bcubed Precision #Iterations User 1_R 2 Randomlygenerated NA 0.117 0.895 0.206 0.171 0.111 19 User 1_U 2 Random Inputvaluebyuser 54,678 0.117 0.895 0.206 0.171 0.111 18 User 2_R 3 Randomlygenerated NA 0.11 0.742 0.192 0.248 0.099 14 User 2_U 3 Random Inputvaluebyuser 16,390,481 0.116 0.815 0.203 0.229 0.095 25 User 3_R 4 Randomlygenerated NA 0.115 0.656 0.196 0.352 0.114 25 User 3_U 4 Random Inputvaluebyuser 4,317,590,712 0.111 0.782 0.194 0.217 0.098 29 User 4_R 5 Randomlygenerated NA 0.11 0.61 0.187 0.361 0.108 23 User 4_U 5 Random Inputvaluebyuser 5,18,386,495,600 0.113 0.721 0.195 0.283 0.105 38 0.113625 0.7645 0.197375 0.254 0.105125 23.875 Integrated OKM Method 3 HarmonicMean method 1,254,524 0.131 0.802 0.226 0.337 0.099 8 Average
  • 42.
  • 43. Graphs for Comparison of Results: Diabetes Disease Data Set (1) 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 precision precision Proposed OKM algorithm 0.131 User 4_U 0.113 User 4_R 0.11 User 3_U 0.111 User 3_R 0.115 User 2_U 0.116 User 2_R 0.11 User 1_U 0.117 User 1_R 0.117 PRECISON 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Recall Proposed OKM algorithm 0.802 User 4_U 0.721 User 4_R 0.61 User 3_U 0.782 User 3_R 0.656 User 2_U 0.815 User 2_R 0.742 User 1_U 0.895 User 1_R 0.895 RECALL
  • 44. Graphs for Comparison of Results: Diabetes Disease Data Set (2) 0 0.05 0.1 0.15 0.2 0.25 F-measure F-measure Proposed OKM algorithm 0.226 User 4_U 0.195 User 4_R 0.187 User 3_U 0.194 User 3_R 0.196 User 2_U 0.203 User 2_R 0.192 User 1_U 0.206 User 1_R 0.206 F MEASURE 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Rand Index Rand Index Proposed OKM algorithm 0.337 User 4_U 0.283 User 4_R 0.361 User 3_U 0.217 User 3_R 0.352 User 2_U 0.229 User 2_R 0.248 User 1_U 0.171 User 1_R 0.171 RAND INDEX
  • 45. Graphs for Comparison of Results: Diabetes Disease Data Set (3) 0 0.02 0.04 0.06 0.08 0.1 0.12 Bcubed Precision Bcubed Precision Proposed OKM algorithm 0.099 User 4_U 0.105 User 4_R 0.108 User 3_U 0.098 User 3_R 0.114 User 2_U 0.095 User 2_R 0.099 User 1_U 0.111 User 1_R 0.111 B CUBED PRECISION 0 5 10 15 20 25 30 35 40 # Iterations # Iterations Proposed OKM algorithm 8 User 4_U 38 User 4_R 23 User 3_U 29 User 3_R 25 User 2_U 25 User 2_R 14 User 1_U 18 User 1_R 19 No. OF ITERATIONS
  • 46. Conclusion The thesis was robust enough to show positive results as it: (i) removes the dependency of the method from any random input parameters but also (ii) normalizes the outliers . From the above results we found that, barring one or two accuracy measures, the performance of the Proposed integrated OKM tool is better than usual OKM method.  We can also observe that integrated OKM helps us in reducing the time complexity in both cases as the number of integrations are reducing greatly. As far as future work is concerned, this thesis provides a base for further research on effective improved clustering which can create a long lasting positive impact on medical field and many other fields.
  • 47. Bibliography PAPERS: 1. Argenis A. Aroche-Villarruel1, J.A. Carrasco-Ochoa1, José Fco. Martínez-Trinidad1,J. Arturo Olvera-López2, and Airel Pérez-Suárez3, “Study of Overlapping Clustering Algorithms Based on Kmeans through FBcubed Metric”, Springer International Publishing Switzerland 2014 2. A.Dharmarajan, T. Velmurugan, “Applications of Partition based Clustering Algorithms: A Survey” 2013 IEEE 3. Jaskaranjit Kaur and Harpreet Singh, “Performance Evaluation of a Novel Hybrid Clustering Algorithm using Birch and K-Means” 2015 IEEE 4. Sandeep Kaur and Dr. Sheetal Kalra, “Disease Prediction using Hybrid K-means and Support Vector Machine” 2016 IEEE 5. Preeti Arora, Deepali Virmani, Himanshu Jindal and Mritunjaya Sharma, “Sorted K-Means Towards the Enhancement of K-Means to Form Stable Clusters”, Proceedings of International Conference on Communication and Networks, Springer 2017 6. N. Nidheesh, K.A. Abdul Nazeer, P.M. Ameer, ” An enhanced deterministic K-means clustering algorithm for cancer subtype prediction from gene expression data”, Computers in Biology and Medicine 2017 Elsevier 7. Sina Khanmohammadi, Naiier Adibeig, Samaneh Shanehbandy, “An Improved Overlapping k-Means Clustering Method for Medical Applications”, Expert Systems With Applications 2016 Elsevier 8. Hailong Chen, Chunli LiuZahid, “Research and Application of Cluster Analysis Algorithm”. 2nd International Conference on Measurement, Information and Control, 2013 IEEE 9. Shraddha Shukla and Naganna S, “A Review ON K-means DATA Clustering APPROACH” International Journal of Information & Computation Technology 2014 10. L.V. Bijuraj, “Clustering and its applications”. Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
  • 48. Bibliography 1. Pankaj Saxena and Sushma Lehri, “Analysis of various clustering algorithms of data mining on Health informatics”. International Journal of Computer & Communication Technology 2013 2. K.Rajalakshmi1, Dr.S.S.Dhenakaran2, N.Roobini, “Comparative Analysis of K-Means Algorithm in Disease Prediction” International Journal of Science, Engineering and Technology Research (IJSETR), July 2015 3. Amit Saxena , Mukesh Prasad , Akshansh Gupta , Neha Bharill ,Om Prakash Patel , Aruna Tiwari , Meng Joo Er , Weiping Ding ,Chin-Teng Lin, ” A Review of Clustering Techniques and Developments”. 2017 Elsevier 4. Guillaume Cleuziou, “An extended version of the k-means method for overlapping clustering” 2008 IEEE WEBSITES 1. https://en.wikipedia.org/wiki/Cluster_analysis#Applications 2. http://stp.lingfil.uu.se/~santinim/ml/2016/Lect_10/10c_UnsupervisedMethods.pdf 3. https://en.wikipedia.org/wiki/K-means_clustering 4. https://www.jstatsoft.org/article/view/v050i10 5. https://en.wikipedia.org/wiki/Silhouette_(clustering) 6. https://en.wikipedia.org/wiki/Correlation_clustering 7. http://www.francescobonchi.com/CCtuto_kdd14.pdf