SlideShare una empresa de Scribd logo
1 de 4
Descargar para leer sin conexión
IOSR Journal of Computer Engineering (IOSRJCE)
ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 12-15
www.iosrjournals.org
www.iosrjournals.org 12 | Page
K-NN Classifier Performs Better Than K-Means Clustering in
Missing Value Imputation
Ms.R.Malarvizhi1
, Dr.Antony Selvadoss Thanamani2
,
1 Research Scholar, Research Department of Computer Science, NGM College, 90, Palghat Road, Pollachi -
642 001Coimbatore District, Tamilnadu, India
2 Professor and Head, Research Department of Computer Science, NGM College, 90, Palghat Road, Pollachi -
642 001 Coimbatore District, Tamilnadu, India
Abstract: Missing Data is a widespread problem that can affect the ability to use data to construct effective
predictions systems. We analyze the predictive performance by comparing K-Means Clustering with kNN
Classifier for imputing missing value. For investigation, we simulate with 5 missing data percentages; we found
that k-NN performs better than K-Means Clustering, in terms of accuracy.
Keywords: K-Means clustering, k-NN Classifier, Missing Data, Percentage, Predictive Performance
I. Introduction
Data seems to be missing due to several reasons. Researchers concentrate more on imputing missing
data by handling various Data Mining algorithms. The most traditional missing value imputation techniques are
deleting case, mean value imputation, maximum likelihood and other statistical methods. In recent years
research has explored the use of machine learning techniques as a method for missing values imputation.
Machine learning methods like MLP, SOM, KNN and decision tree have been found to perform better than the
traditional statistical methods. [1]
In this paper we compare two techniques K-Nearest Neighborhood and K-Means Clustering combined
with mean substitution. Both the techniques group the dataset into several groups/ clusters. Mean Substitution is
applied separately to each group / cluster. When both the results are compared, k-NN has an improvement in
percentage of accuracy than K-Means Clustering.
II. Missing Data Mechanism
Missing data can be divided into „Missing at Random‟, (MAR), „Missing completely at Random‟,
(MCAR) and „Missing Not at Random‟, (MNAR).
1. Missing at Random
Missing at Random requires that the cause of missing data be unrelated to the missing values
themselves. However, the cause may be related to other observed variables. MAR is also known as the ignorable
case (Schafer, 1997) and occurs when cases with missing data are different from the observed cases but the
pattern of missing data is predictable from other observed variables. Differently said, the cause of the missing
data is due to external influence and not to the variable itself.
2. Missing Completely at Random
Missing Completely At Random refers to a condition where the probability of data missing is unrelated
to the values of any other variables, whether missing or observed. In this mechanism, cases with complete data
are indistinguishable from cases with incomplete data.
3. Missing Not at Random
Missing not at random (MNAR) implies that the missing data mechanism is related to the missing
values.[4]
III. Missing Data Imputation Methods
Imputation methods involve replacing missing values with estimated ones based on some information
available in the data set. There are many options varying from naïve methods like mean imputation to some
more robust methods based on relationships among attributes.
1. Case substitution
This method is typically used in sample surveys. One instance with missing data (for example, a
person that cannot be contacted) is replaced by another non sampled instance;
K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imputation
www.iosrjournals.org 13 | Page
2. Mean and mode
This method consists of replacing the missing data for a given attribute by the mean (quantitative
attribute) or mode (qualitative attribute) of all known values of that attribute;
3. Hot deck and cold deck
In the hot deck method, a missing attribute value is filled in with a value from an estimated distribution
for the missing value from the current data. Hot deck is typically implemented into two stages. In the first stage,
the data are partitioned into clusters. And, in the second stage, each instance with missing data is associated with
one cluster. The complete cases in a cluster are used to fill in the missing values. This can be done by
calculating the mean or mode of the attribute within a cluster. Cold deck imputation is similar to hot deck but
the data source must be other than the current data source;
4. Prediction model
Prediction models are sophisticated procedures for handling missing data. These methods consist of
creating a predictive model to estimate values that will substitute the missing data. The attribute with missing
data is used as class-attribute, and the remaining attributes are used as input for the predictive model. An
important argument in favour of this approach is that, frequently, attributes have relationships (correlations)
among themselves. In this way, those correlations could be used to create a predictive model for classification or
regression (depending on the attribute type with missing data, being, respectively, nominal or continuous). Some
of these relationships among the attributes may be maintained if they were captured by the predictive model.
IV. Imputation with K-MEans Clustering Algorithm
K-Means is an algorithm to classify or to group your objects based on attributes/features into K number of
group. K is positive integer number. The grouping is done by minimizing the sum of squares of distances between
data and the corresponding cluster centroid. Thus, the purpose of K-mean clustering is to classify the data.
V. Imputation with K-nearest Neighbor Algorithm
Nearest Neighbour Analysis is a method for classifying cases based on their similarity to other cases. In
machine learning, it was developed as a way to recognize patterns of data without requiring an exact match to
any stored patterns, or cases. Similar cases are near each other and dissimilar cases are distant from each other.
Thus, the distance between two cases is a measure of their dissimilarity. Cases that are near each other are said
to be “neighbours.” When a new case (holdout) is presented, its distance from each of the cases in the model is
computed. The classifications of the most similar cases – the nearest neighbours – are tallied and the new case is
placed into the category that contains the greatest number of nearest neighbours.
VI. Framework And Experimental Analysis
The following framework explains about the comparison made in this paper. The experimental results
show the improvement in performance.
ALGORITHM:
1. Determine the centroid coordinate
2. Determine the distance of each object to the centroids
3. Group the object based on minimum distance (find the closest centroid)
K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imputation
www.iosrjournals.org 14 | Page
k-NN K-Means
Initially test dataset is made by replacing some original values with missing value(NaN-Not a Number). Now
the original dataset and test data set is partitioned into clusters in case of K-Means and groups in case of k-NN.
Missing value in each group/cluster is filled with mean value. Now the test dataset is compared with the original
dataset for finding the accuracy of performance. This process is repeated for various missing percentages
2,5,10,15 and 20.
Missing
Percentage
Mean
Substitution
K-Means
Clustering
k-Nearest
Neighbor
2% 67 62 69
5% 63 65 69
10% 59 63 68
15% 59 64 67
20% 52 56 60
Percentage 60% 62% 67%
Thus the results are compared. It shows some improvement in percentage of accuracy.
VII. Conclusion And Future Enhancement
K-Means and KNN methods provide fast and accurate ways of estimating missing values.KNN –based
imputations provides for a robust and sensitive approach to estimating missing data. Hence it is recommend
KNN –based method for imputation of missing values. It is also analyzed that when the missing percentage is
high, whatever the method is the accuracy decreases.
This proposed method can be enhanced by comparing various machine learning techniques like SOM,
MLP. Mean Substitution can be replaced by mode, median, standard deviation or by applying Expectation –
Maximization, regression based methods.
ORIGINALD
ATA
WITHOUT
MISSING
VALUE
Test Data
with Missing
Values
Divide into Groups
Impute Missing Values by Mean
Substitution
Compare with Original Dataset
Test Data
with Missing
Values
Partition into Clusters
Impute Missing Values by Mean
Substitution
Compare with Original Dataset
k-NN performs better in
terms of accuracy
K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imputation
www.iosrjournals.org 15 | Page
References
Journal Papers:
[1] J.L Peugh, and C.K. Enders, “Missing data in Educational Research: A review of reporting practices and suggestions for
improvement, “Review of Educational Research vol 74, pp 525-556, 2004.
[2] S-R. R. Ester-Lydia , Pino – Mejias Manuel, Lopez Coello Maria-Dolores , Cubiles – de – la- Vega, “Missing value imputation on
Missing completely at Random data using multilayer perceptrons, “Neural Networks, no 1, 2011.
[3] B.Mehala, P.Ranjit Jeba Thangaiah and K.Vivekanandan , “ Selecting Scalable Algorithms to Deal with Missing Values” ,
International Journal of Recent Trends in engineering, vol.1. No 2, May 2009.
[4] Gustavo E.A.P.A. Batista and Maria Carolina Monard , “A Study of K-Nearest Neighbour as an Imputation method”.
[5] Allison, P.D-“Missing Data”, Thousand Oaks, CA: Sage -2001.
[6] Bennett, D.A. “How can I deal with missing data in my study? Australian and New Zealand Journal of Public Health”, 25, pp.464 –
469, 2001.
[7] Kin Wagstaff ,”Clustering with Missing Values : No Imputation Required” -NSF grant IIS-0325329,pp.1-10.
[8] S.Hichao Zhang , Jilian Zhang, Xiaofeng Zhu, Yongsong Qin,chengqi Zhang , “Missing Value Imputation Based on Data
Clustering”, Springer-Verlag Berlin, Heidelberg ,2008.
[9] Richard J.Hathuway , James C.Bezex, Jacalyn M.Huband , “Scalable Visual Assessment of Cluster Tendency for Large Data Sets”,
Pattern Recognition ,Volume 39, Issue 7,pp,1315-1324- Feb 2006.
[10] Qinbao Song, Martin Shepperd ,”A New Imputation Method for Small Software Project Data set”, The Journal of Systems and
Software 80 ,pp,51–62, 2007
[11] Gabriel L.Scholmer, Sheri Bauman and Noel A.card “Best practices for Missing Data Management in Counseling Psychology”,
Journal of Counseling Psychology, Vol. 57, No. 1,pp. 1–10,2010.
[12] R.Kavitha Kumar, Dr.R.M Chandrasekar,“Missing Data Imputation in Cardiac Data Set” ,International Journal on Computer
Science and Engineering , Vol.02 , No.05,pp-1836 – 1840 , 2010.
[13] Jinhai Ma, Noori Aichar –Danesh , Lisa Dolovich, Lahana Thabane , “Imputation Strategies for Missing Binary Outcomes in
Cluster Randomized Trials”- BMC Med Res Methodol. 2011; pp- 11: 18. – 2011.
[14] R.S.Somasundaram , R.Nedunchezhian , “Evaluation of Three Simple Imputation Methods for Enhancing Preprocessing of Data
with Missing Values”, International Journal of Computer Applications (0975 – 8887) Volume 21 – No.10 ,pp.14-19 ,May 2011.
[15] K.Raja , G.Tholkappia Arasu , Chitra. S.Nair , “Imputation Framework for Missing Value” , International Journal of Computer
Trends and Technology – volume3Issue2 – 2012.
[16] BOB L.Wall , Jeff K.Elser – “Imputation of Missing Data for Input to Support Vector Machines” ,
Theses:
[17] Benjamin M. Marlin - Missing Data Problems in Machine Learning - 2008

Más contenido relacionado

La actualidad más candente

WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForestWisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
Sheing Jing Ng
 
48 modified paper id 0051 edit septian
48 modified paper id 0051 edit septian48 modified paper id 0051 edit septian
48 modified paper id 0051 edit septian
IAESIJEECS
 
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
journalBEEI
 
Performance analysis of neural network models for oxazolines and oxazoles der...
Performance analysis of neural network models for oxazolines and oxazoles der...Performance analysis of neural network models for oxazolines and oxazoles der...
Performance analysis of neural network models for oxazolines and oxazoles der...
ijistjournal
 

La actualidad más candente (20)

IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
 
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
 
20 26 jan17 walter latex
20 26 jan17 walter latex20 26 jan17 walter latex
20 26 jan17 walter latex
 
02 Related Concepts
02 Related Concepts02 Related Concepts
02 Related Concepts
 
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForestWisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
 
B0930610
B0930610B0930610
B0930610
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
 
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
 
48 modified paper id 0051 edit septian
48 modified paper id 0051 edit septian48 modified paper id 0051 edit septian
48 modified paper id 0051 edit septian
 
An Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using ClusteringAn Empirical Study for Defect Prediction using Clustering
An Empirical Study for Defect Prediction using Clustering
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
 
A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MINING
 
Survey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy AlgorithmsSurvey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy Algorithms
 
Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine Learning
 
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...Improving K-NN Internet Traffic Classification Using Clustering and Principle...
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
 
Link prediction in networks with core-fringe structure
Link prediction in networks with core-fringe structureLink prediction in networks with core-fringe structure
Link prediction in networks with core-fringe structure
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
15-088-pub
15-088-pub15-088-pub
15-088-pub
 
Az36311316
Az36311316Az36311316
Az36311316
 
Performance analysis of neural network models for oxazolines and oxazoles der...
Performance analysis of neural network models for oxazolines and oxazoles der...Performance analysis of neural network models for oxazolines and oxazoles der...
Performance analysis of neural network models for oxazolines and oxazoles der...
 

Destacado

Effect of Replacement of Sweet Orange (Citrus Sinensis) Peel Meal with Maize ...
Effect of Replacement of Sweet Orange (Citrus Sinensis) Peel Meal with Maize ...Effect of Replacement of Sweet Orange (Citrus Sinensis) Peel Meal with Maize ...
Effect of Replacement of Sweet Orange (Citrus Sinensis) Peel Meal with Maize ...
IOSR Journals
 
Zinc Sulfate As a Growth Disruptor for Spodoptera littoralis With Reference t...
Zinc Sulfate As a Growth Disruptor for Spodoptera littoralis With Reference t...Zinc Sulfate As a Growth Disruptor for Spodoptera littoralis With Reference t...
Zinc Sulfate As a Growth Disruptor for Spodoptera littoralis With Reference t...
IOSR Journals
 

Destacado (20)

I0355561
I0355561I0355561
I0355561
 
An Improvement in Power Management in green Computing using Neural Networks
An Improvement in Power Management in green Computing using Neural NetworksAn Improvement in Power Management in green Computing using Neural Networks
An Improvement in Power Management in green Computing using Neural Networks
 
Performing of the MPPSO Optimization Algorithm to Minimize Line Voltage THD o...
Performing of the MPPSO Optimization Algorithm to Minimize Line Voltage THD o...Performing of the MPPSO Optimization Algorithm to Minimize Line Voltage THD o...
Performing of the MPPSO Optimization Algorithm to Minimize Line Voltage THD o...
 
Comparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text DataComparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text Data
 
Capacitive voltage and current induction phenomena in GIS substation
Capacitive voltage and current induction phenomena in GIS substationCapacitive voltage and current induction phenomena in GIS substation
Capacitive voltage and current induction phenomena in GIS substation
 
A0710113
A0710113A0710113
A0710113
 
Performance evaluation of Hard and Soft Wimax by using PGP and PKM protocols ...
Performance evaluation of Hard and Soft Wimax by using PGP and PKM protocols ...Performance evaluation of Hard and Soft Wimax by using PGP and PKM protocols ...
Performance evaluation of Hard and Soft Wimax by using PGP and PKM protocols ...
 
Effect of Replacement of Sweet Orange (Citrus Sinensis) Peel Meal with Maize ...
Effect of Replacement of Sweet Orange (Citrus Sinensis) Peel Meal with Maize ...Effect of Replacement of Sweet Orange (Citrus Sinensis) Peel Meal with Maize ...
Effect of Replacement of Sweet Orange (Citrus Sinensis) Peel Meal with Maize ...
 
Modeling Of Carbon Deposit From Methane Gas On Zeolite Y Catalyst Activity In...
Modeling Of Carbon Deposit From Methane Gas On Zeolite Y Catalyst Activity In...Modeling Of Carbon Deposit From Methane Gas On Zeolite Y Catalyst Activity In...
Modeling Of Carbon Deposit From Methane Gas On Zeolite Y Catalyst Activity In...
 
Поэма А.С. Пушкина "Руслан и Людмила"
Поэма А.С. Пушкина "Руслан и Людмила"Поэма А.С. Пушкина "Руслан и Людмила"
Поэма А.С. Пушкина "Руслан и Людмила"
 
Zinc Sulfate As a Growth Disruptor for Spodoptera littoralis With Reference t...
Zinc Sulfate As a Growth Disruptor for Spodoptera littoralis With Reference t...Zinc Sulfate As a Growth Disruptor for Spodoptera littoralis With Reference t...
Zinc Sulfate As a Growth Disruptor for Spodoptera littoralis With Reference t...
 
царскосельский лицей а.с.пушкина
царскосельский лицей а.с.пушкинацарскосельский лицей а.с.пушкина
царскосельский лицей а.с.пушкина
 
Q7
Q7Q7
Q7
 
Power Quality Improvement in Faulty Conditions using Tuned Harmonic Filters
Power Quality Improvement in Faulty Conditions using Tuned Harmonic FiltersPower Quality Improvement in Faulty Conditions using Tuned Harmonic Filters
Power Quality Improvement in Faulty Conditions using Tuned Harmonic Filters
 
Performance Analysis of CSI Based PV system During LL and TPG faults
Performance Analysis of CSI Based PV system During LL and TPG faultsPerformance Analysis of CSI Based PV system During LL and TPG faults
Performance Analysis of CSI Based PV system During LL and TPG faults
 
B0441418
B0441418B0441418
B0441418
 
Planning for Planning
Planning for PlanningPlanning for Planning
Planning for Planning
 
Detection of Malignancy in Digital Mammograms from Segmented Breast Region Us...
Detection of Malignancy in Digital Mammograms from Segmented Breast Region Us...Detection of Malignancy in Digital Mammograms from Segmented Breast Region Us...
Detection of Malignancy in Digital Mammograms from Segmented Breast Region Us...
 
D0931621
D0931621D0931621
D0931621
 
The Wind Powered Generator
The Wind Powered GeneratorThe Wind Powered Generator
The Wind Powered Generator
 

Similar a K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imputation

Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
Clustering heterogeneous categorical data using enhanced mini  batch K-means ...Clustering heterogeneous categorical data using enhanced mini  batch K-means ...
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
IJECEIAES
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
IJERD Editor
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Editor IJMTER
 

Similar a K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imputation (20)

Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
 
I017235662
I017235662I017235662
I017235662
 
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
Clustering heterogeneous categorical data using enhanced mini  batch K-means ...Clustering heterogeneous categorical data using enhanced mini  batch K-means ...
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
F04463437
F04463437F04463437
F04463437
 
Di35605610
Di35605610Di35605610
Di35605610
 
Unsupervised Machine Learning Algorithm K-means-Clustering.pptx
Unsupervised Machine Learning Algorithm K-means-Clustering.pptxUnsupervised Machine Learning Algorithm K-means-Clustering.pptx
Unsupervised Machine Learning Algorithm K-means-Clustering.pptx
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
G0354451
G0354451G0354451
G0354451
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Performance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering AlgorithmPerformance Analysis of Different Clustering Algorithm
Performance Analysis of Different Clustering Algorithm
 
F017132529
F017132529F017132529
F017132529
 
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
Enhancing Keyword Query Results Over Database for Improving User Satisfaction Enhancing Keyword Query Results Over Database for Improving User Satisfaction
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
 
Ijetr021251
Ijetr021251Ijetr021251
Ijetr021251
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
A Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsA Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text Documents
 

Más de IOSR Journals

Más de IOSR Journals (20)

A011140104
A011140104A011140104
A011140104
 
M0111397100
M0111397100M0111397100
M0111397100
 
L011138596
L011138596L011138596
L011138596
 
K011138084
K011138084K011138084
K011138084
 
J011137479
J011137479J011137479
J011137479
 
I011136673
I011136673I011136673
I011136673
 
G011134454
G011134454G011134454
G011134454
 
H011135565
H011135565H011135565
H011135565
 
F011134043
F011134043F011134043
F011134043
 
E011133639
E011133639E011133639
E011133639
 
D011132635
D011132635D011132635
D011132635
 
C011131925
C011131925C011131925
C011131925
 
B011130918
B011130918B011130918
B011130918
 
A011130108
A011130108A011130108
A011130108
 
I011125160
I011125160I011125160
I011125160
 
H011124050
H011124050H011124050
H011124050
 
G011123539
G011123539G011123539
G011123539
 
F011123134
F011123134F011123134
F011123134
 
E011122530
E011122530E011122530
E011122530
 
D011121524
D011121524D011121524
D011121524
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imputation

  • 1. IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 12-15 www.iosrjournals.org www.iosrjournals.org 12 | Page K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imputation Ms.R.Malarvizhi1 , Dr.Antony Selvadoss Thanamani2 , 1 Research Scholar, Research Department of Computer Science, NGM College, 90, Palghat Road, Pollachi - 642 001Coimbatore District, Tamilnadu, India 2 Professor and Head, Research Department of Computer Science, NGM College, 90, Palghat Road, Pollachi - 642 001 Coimbatore District, Tamilnadu, India Abstract: Missing Data is a widespread problem that can affect the ability to use data to construct effective predictions systems. We analyze the predictive performance by comparing K-Means Clustering with kNN Classifier for imputing missing value. For investigation, we simulate with 5 missing data percentages; we found that k-NN performs better than K-Means Clustering, in terms of accuracy. Keywords: K-Means clustering, k-NN Classifier, Missing Data, Percentage, Predictive Performance I. Introduction Data seems to be missing due to several reasons. Researchers concentrate more on imputing missing data by handling various Data Mining algorithms. The most traditional missing value imputation techniques are deleting case, mean value imputation, maximum likelihood and other statistical methods. In recent years research has explored the use of machine learning techniques as a method for missing values imputation. Machine learning methods like MLP, SOM, KNN and decision tree have been found to perform better than the traditional statistical methods. [1] In this paper we compare two techniques K-Nearest Neighborhood and K-Means Clustering combined with mean substitution. Both the techniques group the dataset into several groups/ clusters. Mean Substitution is applied separately to each group / cluster. When both the results are compared, k-NN has an improvement in percentage of accuracy than K-Means Clustering. II. Missing Data Mechanism Missing data can be divided into „Missing at Random‟, (MAR), „Missing completely at Random‟, (MCAR) and „Missing Not at Random‟, (MNAR). 1. Missing at Random Missing at Random requires that the cause of missing data be unrelated to the missing values themselves. However, the cause may be related to other observed variables. MAR is also known as the ignorable case (Schafer, 1997) and occurs when cases with missing data are different from the observed cases but the pattern of missing data is predictable from other observed variables. Differently said, the cause of the missing data is due to external influence and not to the variable itself. 2. Missing Completely at Random Missing Completely At Random refers to a condition where the probability of data missing is unrelated to the values of any other variables, whether missing or observed. In this mechanism, cases with complete data are indistinguishable from cases with incomplete data. 3. Missing Not at Random Missing not at random (MNAR) implies that the missing data mechanism is related to the missing values.[4] III. Missing Data Imputation Methods Imputation methods involve replacing missing values with estimated ones based on some information available in the data set. There are many options varying from naïve methods like mean imputation to some more robust methods based on relationships among attributes. 1. Case substitution This method is typically used in sample surveys. One instance with missing data (for example, a person that cannot be contacted) is replaced by another non sampled instance;
  • 2. K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imputation www.iosrjournals.org 13 | Page 2. Mean and mode This method consists of replacing the missing data for a given attribute by the mean (quantitative attribute) or mode (qualitative attribute) of all known values of that attribute; 3. Hot deck and cold deck In the hot deck method, a missing attribute value is filled in with a value from an estimated distribution for the missing value from the current data. Hot deck is typically implemented into two stages. In the first stage, the data are partitioned into clusters. And, in the second stage, each instance with missing data is associated with one cluster. The complete cases in a cluster are used to fill in the missing values. This can be done by calculating the mean or mode of the attribute within a cluster. Cold deck imputation is similar to hot deck but the data source must be other than the current data source; 4. Prediction model Prediction models are sophisticated procedures for handling missing data. These methods consist of creating a predictive model to estimate values that will substitute the missing data. The attribute with missing data is used as class-attribute, and the remaining attributes are used as input for the predictive model. An important argument in favour of this approach is that, frequently, attributes have relationships (correlations) among themselves. In this way, those correlations could be used to create a predictive model for classification or regression (depending on the attribute type with missing data, being, respectively, nominal or continuous). Some of these relationships among the attributes may be maintained if they were captured by the predictive model. IV. Imputation with K-MEans Clustering Algorithm K-Means is an algorithm to classify or to group your objects based on attributes/features into K number of group. K is positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus, the purpose of K-mean clustering is to classify the data. V. Imputation with K-nearest Neighbor Algorithm Nearest Neighbour Analysis is a method for classifying cases based on their similarity to other cases. In machine learning, it was developed as a way to recognize patterns of data without requiring an exact match to any stored patterns, or cases. Similar cases are near each other and dissimilar cases are distant from each other. Thus, the distance between two cases is a measure of their dissimilarity. Cases that are near each other are said to be “neighbours.” When a new case (holdout) is presented, its distance from each of the cases in the model is computed. The classifications of the most similar cases – the nearest neighbours – are tallied and the new case is placed into the category that contains the greatest number of nearest neighbours. VI. Framework And Experimental Analysis The following framework explains about the comparison made in this paper. The experimental results show the improvement in performance. ALGORITHM: 1. Determine the centroid coordinate 2. Determine the distance of each object to the centroids 3. Group the object based on minimum distance (find the closest centroid)
  • 3. K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imputation www.iosrjournals.org 14 | Page k-NN K-Means Initially test dataset is made by replacing some original values with missing value(NaN-Not a Number). Now the original dataset and test data set is partitioned into clusters in case of K-Means and groups in case of k-NN. Missing value in each group/cluster is filled with mean value. Now the test dataset is compared with the original dataset for finding the accuracy of performance. This process is repeated for various missing percentages 2,5,10,15 and 20. Missing Percentage Mean Substitution K-Means Clustering k-Nearest Neighbor 2% 67 62 69 5% 63 65 69 10% 59 63 68 15% 59 64 67 20% 52 56 60 Percentage 60% 62% 67% Thus the results are compared. It shows some improvement in percentage of accuracy. VII. Conclusion And Future Enhancement K-Means and KNN methods provide fast and accurate ways of estimating missing values.KNN –based imputations provides for a robust and sensitive approach to estimating missing data. Hence it is recommend KNN –based method for imputation of missing values. It is also analyzed that when the missing percentage is high, whatever the method is the accuracy decreases. This proposed method can be enhanced by comparing various machine learning techniques like SOM, MLP. Mean Substitution can be replaced by mode, median, standard deviation or by applying Expectation – Maximization, regression based methods. ORIGINALD ATA WITHOUT MISSING VALUE Test Data with Missing Values Divide into Groups Impute Missing Values by Mean Substitution Compare with Original Dataset Test Data with Missing Values Partition into Clusters Impute Missing Values by Mean Substitution Compare with Original Dataset k-NN performs better in terms of accuracy
  • 4. K-NN Classifier Performs Better Than K-Means Clustering in Missing Value Imputation www.iosrjournals.org 15 | Page References Journal Papers: [1] J.L Peugh, and C.K. Enders, “Missing data in Educational Research: A review of reporting practices and suggestions for improvement, “Review of Educational Research vol 74, pp 525-556, 2004. [2] S-R. R. Ester-Lydia , Pino – Mejias Manuel, Lopez Coello Maria-Dolores , Cubiles – de – la- Vega, “Missing value imputation on Missing completely at Random data using multilayer perceptrons, “Neural Networks, no 1, 2011. [3] B.Mehala, P.Ranjit Jeba Thangaiah and K.Vivekanandan , “ Selecting Scalable Algorithms to Deal with Missing Values” , International Journal of Recent Trends in engineering, vol.1. No 2, May 2009. [4] Gustavo E.A.P.A. Batista and Maria Carolina Monard , “A Study of K-Nearest Neighbour as an Imputation method”. [5] Allison, P.D-“Missing Data”, Thousand Oaks, CA: Sage -2001. [6] Bennett, D.A. “How can I deal with missing data in my study? Australian and New Zealand Journal of Public Health”, 25, pp.464 – 469, 2001. [7] Kin Wagstaff ,”Clustering with Missing Values : No Imputation Required” -NSF grant IIS-0325329,pp.1-10. [8] S.Hichao Zhang , Jilian Zhang, Xiaofeng Zhu, Yongsong Qin,chengqi Zhang , “Missing Value Imputation Based on Data Clustering”, Springer-Verlag Berlin, Heidelberg ,2008. [9] Richard J.Hathuway , James C.Bezex, Jacalyn M.Huband , “Scalable Visual Assessment of Cluster Tendency for Large Data Sets”, Pattern Recognition ,Volume 39, Issue 7,pp,1315-1324- Feb 2006. [10] Qinbao Song, Martin Shepperd ,”A New Imputation Method for Small Software Project Data set”, The Journal of Systems and Software 80 ,pp,51–62, 2007 [11] Gabriel L.Scholmer, Sheri Bauman and Noel A.card “Best practices for Missing Data Management in Counseling Psychology”, Journal of Counseling Psychology, Vol. 57, No. 1,pp. 1–10,2010. [12] R.Kavitha Kumar, Dr.R.M Chandrasekar,“Missing Data Imputation in Cardiac Data Set” ,International Journal on Computer Science and Engineering , Vol.02 , No.05,pp-1836 – 1840 , 2010. [13] Jinhai Ma, Noori Aichar –Danesh , Lisa Dolovich, Lahana Thabane , “Imputation Strategies for Missing Binary Outcomes in Cluster Randomized Trials”- BMC Med Res Methodol. 2011; pp- 11: 18. – 2011. [14] R.S.Somasundaram , R.Nedunchezhian , “Evaluation of Three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values”, International Journal of Computer Applications (0975 – 8887) Volume 21 – No.10 ,pp.14-19 ,May 2011. [15] K.Raja , G.Tholkappia Arasu , Chitra. S.Nair , “Imputation Framework for Missing Value” , International Journal of Computer Trends and Technology – volume3Issue2 – 2012. [16] BOB L.Wall , Jeff K.Elser – “Imputation of Missing Data for Input to Support Vector Machines” , Theses: [17] Benjamin M. Marlin - Missing Data Problems in Machine Learning - 2008