SlideShare una empresa de Scribd logo
1 de 11
A Fast Clustering-Based Feature Subset Selection Algorithm
for High-Dimensional Data
ABSTRACT:
Feature selection involves identifying a subset of the most useful features that
produces compatible results as the original entire set of features. A feature
selection algorithm may be evaluated from both the efficiency and effectiveness
points of view. While the efficiency concerns the time required to find a subset of
features, the effectiveness is related to the quality of the subset of features. Based
on these criteria, a fast clustering-based feature selection algorithm (FAST) is
proposed and experimentally evaluated in this paper. The FAST algorithm works
in two steps. In the first step, features are divided into clusters by using graph-
theoretic clustering methods. In the second step, the most representative feature
that is strongly related to target classes is selected from each cluster to form a
subset of features. Features in different clusters are relatively independent, the
clustering-based strategy of FAST has a high probability of producing a subset of
useful and independent features. To ensure the efficiency of FAST, we adopt the
efficient minimum-spanning tree (MST) clustering method. The efficiency and
effectiveness of the FAST algorithm are evaluated through an empirical study.
Extensive experiments are carried out to compare FAST and several representative
feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-
SF, with respect to four types of well-known classifiers, namely, the
probabilitybased Naive Bayes, the tree-based C4.5, the instance-based IB1, and the
rule-based RIPPER before and after feature selection. The results, on 35 publicly
available real-world high-dimensional image, microarray, and text data,
demonstrate that the FAST not only produces smaller subsets of features but also
improves the performances of the four types of classifiers.
EXISTING SYSTEM:
The embedded methods incorporate feature selection as a part of the training
process and are usually specific to given learning algorithms, and therefore may be
more efficient than the other three categories. Traditional machine learning
algorithms like decision trees or artificial neural networks are examples of
embedded approaches. The wrapper methods use the predictive accuracy of a
predetermined learning algorithm to determine the goodness of the selected
subsets, the accuracy of the learning algorithms is usually high. However, the
generality of the selected features is limited and the computational complexity is
large. The filter methods are independent of learning algorithms, with good
generality. Their computational complexity is low, but the accuracy of the learning
algorithms is not guaranteed. The hybrid methods are a combination of filter and
wrapper methods by using a filter method to reduce search space that will be
considered by the subsequent wrapper. They mainly focus on combining filter and
wrapper methods to achieve the best possible performance with a particular
learning algorithm with similar time complexity of the filter methods.
DISADVANTAGES OF EXISTING SYSTEM:
The generality of the selected features is limited and the computational
complexity is large.
Their computational complexity is low, but the accuracy of the learning
algorithms is not guaranteed.
The hybrid methods are a combination of filter and wrapper methods by
using a filter method to reduce search space that will be considered by the
subsequent wrapper.
PROPOSED SYSTEM
Feature subset selection can be viewed as the process of identifying and removing
as many irrelevant and redundant features as possible. This is because irrelevant
features do not contribute to the predictive accuracy and redundant features do not
redound to getting a better predictor for that they provide mostly information
which is already present in other feature(s). Of the many feature subset selection
algorithms, some can effectively eliminate irrelevant features but fail to handle
redundant features yet some of others can eliminate the irrelevant while taking care
of the redundant features. Our proposed FAST algorithm falls into the second
group. Traditionally, feature subset selection research has focused on searching for
relevant features. A well-known example is Relief which weighs each feature
according to its ability to discriminate instances under different targets based on
distance-based criteria function. However, Relief is ineffective at removing
redundant features as two predictive but highly correlated features are likely both
to be highly weighted. Relief-F extends Relief, enabling this method to work with
noisy and incomplete data sets and to deal with multiclass problems, but still
cannot identify redundant features.
ADVANTAGES OF PROPOSED SYSTEM:
Good feature subsets contain features highly correlated with (predictive of)
the class, yet uncorrelated with (not predictive of) each other.
The efficiently and effectively deal with both irrelevant and redundant
features, and obtain a good feature subset.
Generally all the six algorithms achieve significant reduction of
dimensionality by selecting only a small portion of the original features.
The null hypothesis of the Friedman test is that all the feature selection
algorithms are equivalent in terms of runtime.
MODULES:
 Distributed clustering
 Subset Selection Algorithm
 Time complexity
 Microarray data
 Data Resource
 Irrelevant feature
MODULE DESCRIPTION
1. Distributed clustering
The Distributional clustering has been used to cluster words into groups based
either on their participation in particular grammatical relations with other words by
Pereira et al. or on the distribution of class labels associated with each word by
Baker and McCallum . As distributional clustering of words are agglomerative in
nature, and result in suboptimal word clusters and high computational cost,
proposed a new information-theoretic divisive algorithm for word clustering and
applied it to text classification. proposed to cluster features using a special metric
of distance, and then makes use of the of the resulting cluster hierarchy to choose
the most relevant attributes. Unfortunately, the cluster evaluation measure based on
distance does not identify a feature subset that allows the classifiers to improve
their original performance accuracy. Furthermore, even compared with other
feature selection methods, the obtained accuracy is lower.
2. Subset Selection Algorithm
The Irrelevant features, along with redundant features, severely affect the accuracy
of the learning machines. Thus, feature subset selection should be able to identify
and remove as much of the irrelevant and redundant information as possible.
Moreover, “good feature subsets contain features highly correlated with (predictive
of) the class, yet uncorrelated with (not predictive of) each other. Keeping these in
mind, we develop a novel algorithm which can efficiently and effectively deal with
both irrelevant and redundant features, and obtain a good feature subset.
3. Time complexity
The major amount of work for Algorithm 1 involves the computation of SU values
for TR relevance and F-Correlation, which has linear complexity in terms of the
number of instances in a given data set. The first part of the algorithm has a linear
time complexity in terms of the number of features m. Assuming features are
selected as relevant ones in the first part, when k ¼ only one feature is selected.
4. Microarray data
The proportion of selected features has been improved by each of the six
algorithms compared with that on the given data sets. This indicates that the six
algorithms work well with microarray data. FAST ranks 1 again with the
proportion of selected features of 0.71 percent. Of the six algorithms, only CFS
cannot choose features for two data sets whose dimensionalities are 19,994 and
49,152, respectively.
5. Data Resource
The purposes of evaluating the performance and effectiveness of our proposed
FAST algorithm, verifying whether or not the method is potentially useful in
practice, and allowing other researchers to confirm our results, 35 publicly
available data sets1 were used. The numbers of features of the 35 data sets vary
from 37 to 49, 52 with a mean of 7,874. The dimensionalities of the 54.3 percent
data sets exceed 5,000, of which 28.6 percent data sets have more than 10,000
features. The 35 data sets cover a range of application domains such as text, image
and bio microarray data classification. The corresponding statistical information.
Note that for the data sets with continuous-valued features, the well-known off-the-
shelf MDL method was used to discredit the continuous values.
6. Irrelevant feature
The irrelevant feature removal is straightforward once the right relevance measure
is defined or selected, while the redundant feature elimination is a bit of
sophisticated. In our proposed FAST algorithm, it involves 1.the construction of
the minimum spanning tree from a weighted complete graph; 2. The partitioning of
the MST into a forest with each tree representing a cluster; and 3.the selection of
representative features from the clusters.
SYSTEM FLOW:
Data set
Irrelevant feature removal
Selected Feature
Minimum Spinning tree
constriction
Tree partition & representation
feature selection
SYSTEM CONFIGURATION:-
HARDWARE CONFIGURATION:-
 Processor - Pentium –IV
 Speed - 1.1 Ghz
 RAM - 256 MB(min)
 Hard Disk - 20 GB
 Key Board - Standard Windows Keyboard
 Mouse - Two or Three Button Mouse
 Monitor - SVGA
SOFTWARE CONFIGURATION:-
 Operating System : Windows XP
 Programming Language : JAVA
 Java Version : JDK 1.6 & above.
REFERENCE:
Qinbao Song, Jingjie Ni, and Guangtao Wang, “A Fast Clustering-Based Feature
Subset Selection Algorithm for High-Dimensional Data”, IEEE
TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25,
NO. 1, JANUARY 2013.

Más contenido relacionado

La actualidad más candente

Application of three graph Laplacian based semisupervised learning methods to...
Application of three graph Laplacian based semisupervised learning methods to...Application of three graph Laplacian based semisupervised learning methods to...
Application of three graph Laplacian based semisupervised learning methods to...ijbbjournal
 
Iaetsd an efficient and large data base using subset selection algorithm
Iaetsd an efficient and large data base using subset selection algorithmIaetsd an efficient and large data base using subset selection algorithm
Iaetsd an efficient and large data base using subset selection algorithmIaetsd Iaetsd
 
International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...IJCSEIT Journal
 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
 
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...ijcsa
 
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...gregoryg
 
Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...IRJET Journal
 
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHGPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHijdms
 
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm
C LUSTERING  B ASED  A TTRIBUTE  S UBSET  S ELECTION  U SING  F AST  A LGORITHmC LUSTERING  B ASED  A TTRIBUTE  S UBSET  S ELECTION  U SING  F AST  A LGORITHm
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHmIJCI JOURNAL
 
Decentralized Data Fusion Algorithm using Factor Analysis Model
Decentralized Data Fusion Algorithm using Factor Analysis ModelDecentralized Data Fusion Algorithm using Factor Analysis Model
Decentralized Data Fusion Algorithm using Factor Analysis ModelSayed Abulhasan Quadri
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data miningSangeeta Das
 
Bioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisBioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisSangeeta Das
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
 

La actualidad más candente (17)

Application of three graph Laplacian based semisupervised learning methods to...
Application of three graph Laplacian based semisupervised learning methods to...Application of three graph Laplacian based semisupervised learning methods to...
Application of three graph Laplacian based semisupervised learning methods to...
 
Iaetsd an efficient and large data base using subset selection algorithm
Iaetsd an efficient and large data base using subset selection algorithmIaetsd an efficient and large data base using subset selection algorithm
Iaetsd an efficient and large data base using subset selection algorithm
 
International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...
 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
 
PDN for Machine Learning
PDN for Machine LearningPDN for Machine Learning
PDN for Machine Learning
 
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
 
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
 
Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...
 
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHGPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
 
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm
C LUSTERING  B ASED  A TTRIBUTE  S UBSET  S ELECTION  U SING  F AST  A LGORITHmC LUSTERING  B ASED  A TTRIBUTE  S UBSET  S ELECTION  U SING  F AST  A LGORITHm
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm
 
F017533540
F017533540F017533540
F017533540
 
Rohit 10103543
Rohit 10103543Rohit 10103543
Rohit 10103543
 
Decentralized Data Fusion Algorithm using Factor Analysis Model
Decentralized Data Fusion Algorithm using Factor Analysis ModelDecentralized Data Fusion Algorithm using Factor Analysis Model
Decentralized Data Fusion Algorithm using Factor Analysis Model
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 
Bioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisBioinformatics_Sequence Analysis
Bioinformatics_Sequence Analysis
 
Drug discovery presentation
Drug discovery presentationDrug discovery presentation
Drug discovery presentation
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
 

Destacado

2012 - 2013 bulk ieee projects for sale
2012 - 2013 bulk ieee projects for sale2012 - 2013 bulk ieee projects for sale
2012 - 2013 bulk ieee projects for saleJPINFOTECH JAYAPRAKASH
 
Efficient algorithms for neighbor discovery in wireless networks
Efficient algorithms for neighbor discovery in wireless networksEfficient algorithms for neighbor discovery in wireless networks
Efficient algorithms for neighbor discovery in wireless networksJPINFOTECH JAYAPRAKASH
 
A probabilistic model of visual cryptography Scheme With Dynamic Group
A probabilistic model of visual cryptography Scheme With Dynamic GroupA probabilistic model of visual cryptography Scheme With Dynamic Group
A probabilistic model of visual cryptography Scheme With Dynamic GroupJPINFOTECH JAYAPRAKASH
 
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...JPINFOTECH JAYAPRAKASH
 
Discovery and verification of neighbor positions in mobile ad hoc networks
Discovery and verification of neighbor positions in mobile ad hoc networksDiscovery and verification of neighbor positions in mobile ad hoc networks
Discovery and verification of neighbor positions in mobile ad hoc networksJPINFOTECH JAYAPRAKASH
 
Adaptive opportunistic routing for wireless ad hoc networks
Adaptive opportunistic routing for wireless ad hoc networksAdaptive opportunistic routing for wireless ad hoc networks
Adaptive opportunistic routing for wireless ad hoc networksJPINFOTECH JAYAPRAKASH
 
2012 - 2013 DOTNET IEEE PROJECT TITLES
2012 - 2013 DOTNET IEEE PROJECT TITLES2012 - 2013 DOTNET IEEE PROJECT TITLES
2012 - 2013 DOTNET IEEE PROJECT TITLESJPINFOTECH JAYAPRAKASH
 
A real time adaptive algorithm for video streaming over multiple wireless acc...
A real time adaptive algorithm for video streaming over multiple wireless acc...A real time adaptive algorithm for video streaming over multiple wireless acc...
A real time adaptive algorithm for video streaming over multiple wireless acc...JPINFOTECH JAYAPRAKASH
 
Cooperative positioning and tracking in disruption tolerant networks
Cooperative positioning and tracking in disruption tolerant networksCooperative positioning and tracking in disruption tolerant networks
Cooperative positioning and tracking in disruption tolerant networksJPINFOTECH JAYAPRAKASH
 
Packet hiding methods for preventing selective jamming attacks
Packet hiding methods for preventing selective jamming attacksPacket hiding methods for preventing selective jamming attacks
Packet hiding methods for preventing selective jamming attacksJPINFOTECH JAYAPRAKASH
 

Destacado (15)

2012 - 2013 bulk ieee projects for sale
2012 - 2013 bulk ieee projects for sale2012 - 2013 bulk ieee projects for sale
2012 - 2013 bulk ieee projects for sale
 
2015 - 2016 ieee ns2 project titles
2015 - 2016 ieee ns2 project titles2015 - 2016 ieee ns2 project titles
2015 - 2016 ieee ns2 project titles
 
2012-2013 IEEE PROJECT TITLES
2012-2013 IEEE PROJECT TITLES2012-2013 IEEE PROJECT TITLES
2012-2013 IEEE PROJECT TITLES
 
Efficient algorithms for neighbor discovery in wireless networks
Efficient algorithms for neighbor discovery in wireless networksEfficient algorithms for neighbor discovery in wireless networks
Efficient algorithms for neighbor discovery in wireless networks
 
A probabilistic model of visual cryptography Scheme With Dynamic Group
A probabilistic model of visual cryptography Scheme With Dynamic GroupA probabilistic model of visual cryptography Scheme With Dynamic Group
A probabilistic model of visual cryptography Scheme With Dynamic Group
 
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization ...
 
Discovery and verification of neighbor positions in mobile ad hoc networks
Discovery and verification of neighbor positions in mobile ad hoc networksDiscovery and verification of neighbor positions in mobile ad hoc networks
Discovery and verification of neighbor positions in mobile ad hoc networks
 
Adaptive opportunistic routing for wireless ad hoc networks
Adaptive opportunistic routing for wireless ad hoc networksAdaptive opportunistic routing for wireless ad hoc networks
Adaptive opportunistic routing for wireless ad hoc networks
 
2012 - 2013 DOTNET IEEE PROJECT TITLES
2012 - 2013 DOTNET IEEE PROJECT TITLES2012 - 2013 DOTNET IEEE PROJECT TITLES
2012 - 2013 DOTNET IEEE PROJECT TITLES
 
Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string search
 
A real time adaptive algorithm for video streaming over multiple wireless acc...
A real time adaptive algorithm for video streaming over multiple wireless acc...A real time adaptive algorithm for video streaming over multiple wireless acc...
A real time adaptive algorithm for video streaming over multiple wireless acc...
 
An adaptive cloud downloading service
An adaptive cloud downloading serviceAn adaptive cloud downloading service
An adaptive cloud downloading service
 
Cooperative positioning and tracking in disruption tolerant networks
Cooperative positioning and tracking in disruption tolerant networksCooperative positioning and tracking in disruption tolerant networks
Cooperative positioning and tracking in disruption tolerant networks
 
Packet hiding methods for preventing selective jamming attacks
Packet hiding methods for preventing selective jamming attacksPacket hiding methods for preventing selective jamming attacks
Packet hiding methods for preventing selective jamming attacks
 
2013 2014 bulk ieee projects
2013 2014 bulk ieee projects2013 2014 bulk ieee projects
2013 2014 bulk ieee projects
 

Similar a A fast clustering based feature subset selection algorithm for high-dimensional data

JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...IEEEGLOBALSOFTTECHNOLOGIES
 
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...IEEEMEMTECHSTUDENTSPROJECTS
 
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...IEEEGLOBALSOFTTECHNOLOGIES
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...IEEEGLOBALSOFTTECHNOLOGIES
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Cloudsim a fast clustering-based feature subset selection algorithm for high...
Cloudsim  a fast clustering-based feature subset selection algorithm for high...Cloudsim  a fast clustering-based feature subset selection algorithm for high...
Cloudsim a fast clustering-based feature subset selection algorithm for high...ecway
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...ecway
 
Android a fast clustering-based feature subset selection algorithm for high-...
Android  a fast clustering-based feature subset selection algorithm for high-...Android  a fast clustering-based feature subset selection algorithm for high-...
Android a fast clustering-based feature subset selection algorithm for high-...ecway
 
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATAEFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATAIJCI JOURNAL
 
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Waqas Tariq
 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...IJERA Editor
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
An integrated mechanism for feature selection
An integrated mechanism for feature selectionAn integrated mechanism for feature selection
An integrated mechanism for feature selectionsai kumar
 
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...AIRCC Publishing Corporation
 
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
 

Similar a A fast clustering based feature subset selection algorithm for high-dimensional data (20)

JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
 
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
 
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
 
M43016571
M43016571M43016571
M43016571
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Cloudsim a fast clustering-based feature subset selection algorithm for high...
Cloudsim  a fast clustering-based feature subset selection algorithm for high...Cloudsim  a fast clustering-based feature subset selection algorithm for high...
Cloudsim a fast clustering-based feature subset selection algorithm for high...
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
Android a fast clustering-based feature subset selection algorithm for high-...
Android  a fast clustering-based feature subset selection algorithm for high-...Android  a fast clustering-based feature subset selection algorithm for high-...
Android a fast clustering-based feature subset selection algorithm for high-...
 
SEO PROCESS
SEO PROCESSSEO PROCESS
SEO PROCESS
 
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATAEFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
 
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Unsupervised Feature Selection Based on the Distribution of Features Attribut...Unsupervised Feature Selection Based on the Distribution of Features Attribut...
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
 
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
 
D0931621
D0931621D0931621
D0931621
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
An integrated mechanism for feature selection
An integrated mechanism for feature selectionAn integrated mechanism for feature selection
An integrated mechanism for feature selection
 
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
 
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
 

Último

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 

Último (20)

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 

A fast clustering based feature subset selection algorithm for high-dimensional data

  • 1. A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data ABSTRACT: Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm (FAST) is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph- theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST) clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-
  • 2. SF, with respect to four types of well-known classifiers, namely, the probabilitybased Naive Bayes, the tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high-dimensional image, microarray, and text data, demonstrate that the FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers. EXISTING SYSTEM: The embedded methods incorporate feature selection as a part of the training process and are usually specific to given learning algorithms, and therefore may be more efficient than the other three categories. Traditional machine learning algorithms like decision trees or artificial neural networks are examples of embedded approaches. The wrapper methods use the predictive accuracy of a predetermined learning algorithm to determine the goodness of the selected subsets, the accuracy of the learning algorithms is usually high. However, the generality of the selected features is limited and the computational complexity is large. The filter methods are independent of learning algorithms, with good generality. Their computational complexity is low, but the accuracy of the learning algorithms is not guaranteed. The hybrid methods are a combination of filter and wrapper methods by using a filter method to reduce search space that will be considered by the subsequent wrapper. They mainly focus on combining filter and
  • 3. wrapper methods to achieve the best possible performance with a particular learning algorithm with similar time complexity of the filter methods. DISADVANTAGES OF EXISTING SYSTEM: The generality of the selected features is limited and the computational complexity is large. Their computational complexity is low, but the accuracy of the learning algorithms is not guaranteed. The hybrid methods are a combination of filter and wrapper methods by using a filter method to reduce search space that will be considered by the subsequent wrapper. PROPOSED SYSTEM Feature subset selection can be viewed as the process of identifying and removing as many irrelevant and redundant features as possible. This is because irrelevant features do not contribute to the predictive accuracy and redundant features do not redound to getting a better predictor for that they provide mostly information which is already present in other feature(s). Of the many feature subset selection algorithms, some can effectively eliminate irrelevant features but fail to handle redundant features yet some of others can eliminate the irrelevant while taking care
  • 4. of the redundant features. Our proposed FAST algorithm falls into the second group. Traditionally, feature subset selection research has focused on searching for relevant features. A well-known example is Relief which weighs each feature according to its ability to discriminate instances under different targets based on distance-based criteria function. However, Relief is ineffective at removing redundant features as two predictive but highly correlated features are likely both to be highly weighted. Relief-F extends Relief, enabling this method to work with noisy and incomplete data sets and to deal with multiclass problems, but still cannot identify redundant features. ADVANTAGES OF PROPOSED SYSTEM: Good feature subsets contain features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other. The efficiently and effectively deal with both irrelevant and redundant features, and obtain a good feature subset. Generally all the six algorithms achieve significant reduction of dimensionality by selecting only a small portion of the original features. The null hypothesis of the Friedman test is that all the feature selection algorithms are equivalent in terms of runtime.
  • 5. MODULES:  Distributed clustering  Subset Selection Algorithm  Time complexity  Microarray data  Data Resource  Irrelevant feature MODULE DESCRIPTION 1. Distributed clustering The Distributional clustering has been used to cluster words into groups based either on their participation in particular grammatical relations with other words by Pereira et al. or on the distribution of class labels associated with each word by Baker and McCallum . As distributional clustering of words are agglomerative in nature, and result in suboptimal word clusters and high computational cost, proposed a new information-theoretic divisive algorithm for word clustering and applied it to text classification. proposed to cluster features using a special metric of distance, and then makes use of the of the resulting cluster hierarchy to choose the most relevant attributes. Unfortunately, the cluster evaluation measure based on distance does not identify a feature subset that allows the classifiers to improve
  • 6. their original performance accuracy. Furthermore, even compared with other feature selection methods, the obtained accuracy is lower. 2. Subset Selection Algorithm The Irrelevant features, along with redundant features, severely affect the accuracy of the learning machines. Thus, feature subset selection should be able to identify and remove as much of the irrelevant and redundant information as possible. Moreover, “good feature subsets contain features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other. Keeping these in mind, we develop a novel algorithm which can efficiently and effectively deal with both irrelevant and redundant features, and obtain a good feature subset. 3. Time complexity The major amount of work for Algorithm 1 involves the computation of SU values for TR relevance and F-Correlation, which has linear complexity in terms of the number of instances in a given data set. The first part of the algorithm has a linear time complexity in terms of the number of features m. Assuming features are selected as relevant ones in the first part, when k ¼ only one feature is selected. 4. Microarray data
  • 7. The proportion of selected features has been improved by each of the six algorithms compared with that on the given data sets. This indicates that the six algorithms work well with microarray data. FAST ranks 1 again with the proportion of selected features of 0.71 percent. Of the six algorithms, only CFS cannot choose features for two data sets whose dimensionalities are 19,994 and 49,152, respectively. 5. Data Resource The purposes of evaluating the performance and effectiveness of our proposed FAST algorithm, verifying whether or not the method is potentially useful in practice, and allowing other researchers to confirm our results, 35 publicly available data sets1 were used. The numbers of features of the 35 data sets vary from 37 to 49, 52 with a mean of 7,874. The dimensionalities of the 54.3 percent data sets exceed 5,000, of which 28.6 percent data sets have more than 10,000 features. The 35 data sets cover a range of application domains such as text, image and bio microarray data classification. The corresponding statistical information. Note that for the data sets with continuous-valued features, the well-known off-the- shelf MDL method was used to discredit the continuous values. 6. Irrelevant feature
  • 8. The irrelevant feature removal is straightforward once the right relevance measure is defined or selected, while the redundant feature elimination is a bit of sophisticated. In our proposed FAST algorithm, it involves 1.the construction of the minimum spanning tree from a weighted complete graph; 2. The partitioning of the MST into a forest with each tree representing a cluster; and 3.the selection of representative features from the clusters.
  • 9. SYSTEM FLOW: Data set Irrelevant feature removal Selected Feature Minimum Spinning tree constriction Tree partition & representation feature selection
  • 10. SYSTEM CONFIGURATION:- HARDWARE CONFIGURATION:-  Processor - Pentium –IV  Speed - 1.1 Ghz  RAM - 256 MB(min)  Hard Disk - 20 GB  Key Board - Standard Windows Keyboard  Mouse - Two or Three Button Mouse  Monitor - SVGA SOFTWARE CONFIGURATION:-  Operating System : Windows XP  Programming Language : JAVA  Java Version : JDK 1.6 & above. REFERENCE: Qinbao Song, Jingjie Ni, and Guangtao Wang, “A Fast Clustering-Based Feature
  • 11. Subset Selection Algorithm for High-Dimensional Data”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 1, JANUARY 2013.