SlideShare a Scribd company logo
1 of 5
Download to read offline
American Journal of Applied Sciences 9 (1): 127-131, 2012
ISSN 1546-9239
© 2012 Science Publications

                                          A Gene Selection
                           Algorithm using Bayesian Classification Approach
                                      1, 2
                                       Alok Sharma and 2Kuldip K. Paliwal
           1
               School of Engineering and Physics, Faculty of Science Technology and Environment,
                                       University of the South Pacific, Fiji
                                2
                                  Signal Processing Lab, School of Engineering,
                Faculty of Engineering and Information Technology, Griffith University, Australia

       Abstract: In this study, we propose a new feature (or gene) selection algorithm using Bayes
       classification approach. The algorithm can find gene subset crucial for cancer classification problem.
       Problem statement: Gene identification plays important role in human cancer classification problem.
       Several feature selection algorithms have been proposed for analyzing and understanding influential
       genes using gene expression profiles. Approach: The feature selection algorithms aim to explore
       genes that are crucial for accurate cancer classification and also endure biological significance.
       However, the performance of the algorithms is still limited. In this study, we propose a feature
       selection algorithm using Bayesian classification approach. Results: This approach gives promising
       results on gene expression datasets and compares favorably with respect to several other existing
       techniques. Conclusion: The proposed gene selection algorithm using Bayes classification approach
       is shown to find important genes that can provide high classification accuracy on DNA microarray
       gene expression datasets.

       Key words: Bayesian classifier, classification accuracy, feature selection, existing techniques,
                  Bayesian classification, selection algorithms, biological significance, still limited, tissue
                  samples

                    INTRODUCTION                                    microarray datasets the performance is still limited and
                                                                    hence the improvements are necessitated.
     The classification of tissue samples into one of the                In this study, we propose a feature selection
several classes or subclasses using their gene                      algorithm using Bayesian classification approach. The
expression profile is an important task and has been                proposed scheme begins at an empty feature subset and
attracted widespread attention (Sharma and Paliwal,                 includes a feature that provides the maximum
2008). The gene expression profiles measured through                information to the current subset. The process of
DNA microarray technology provide accurate, reliable                including features is terminated when no feature can
and objective cancer classification. It is also possible to         add information to the current subset. The bays
uncover cancer subclasses that are related with the                 classifier is used to judge the merit of features. It is
efficacy of anti-cancer drugs that are hard to be                   considered to be the optimum classifier. However, the
predicted by pathological tests. The feature selection              bays classifier using normal distribution could suffer
algorithms are considered to be an important way of                 from inverse operation of sample covariance matrix due
identifying crucial genes. Various feature selection                to scarce training samples. However, this problem can be
algorithms have been proposed in the literature with                resolved by regularization techniques or pseudo
some advantages and disadvantages (Sharma et al.,                   inversing covariance matrix. The proposed algorithm is
2011b; Tan and Gilbert, 2003; Cong et al., 2005; Golub              experimented on several publically available microarray
et al., 1999; Wang et al., 2005; Li and Wong, 2003; Thi             datasets and promising results have been obtained when
et al., 2008; Yan and Zheng, 2007; Sharma et al.,                   compared with other feature selection algorithms.
2011a). These methods select important genes using
some objective functions. The selected genes are                    Proposed strategy: The purpose of the algorithm is to
expected to have biological significance and should                 select a subset of features s = {s1, s2,…,sm} from the
provide high classification accuracy. However, on many              original feature set f = {f1, f2,…,fd} where d is the
Corresponding Author: Alok Sharma, School of Engineering and Physics, Faculty of Science Technology and Environment,
                      University of the South Pacific, Fiji
                                                              127
Am. J. Applied Sci., 9 (1): 127-131, 2012

dimension of feature vectors and m<d is the number of                     Step 3: By using Eq. 1, compute classification
selected features. A feature fk is included in the subset s,                      accuracy αk using feature subset s ∪ fk on χ val
if for this fk, the subset s gives the highest classification                     segment.
accuracy (or the lowest misclassification error). Let χ                   Step 4: Repeat Steps 1-3 N times (we use N = 10) to
={x1, x2,…xn} be the training sample set where each xi                            get an average classification accuracy α k (for
is a d-dimensional vector. Let x i ∈ ℜ m be the
                                           ˆ                                      k = 1, 2, …, d).
corresponding vector having its features defined by                       Step 5: Allocate the feature to the subset s which gives
subset s. Let Ω = {ωj; j = 1, 2 ,…c} be the finite set of                         highest α k , i.e., p = arg max α k and include fp in
 c classes and χ j be the set of m-dimensional training
                   ˆ
                                                                                  subset s; i.e., s ⇐ s∪ fp. If two or more
vectors x i of class ωj. The Bayesian classification
        ˆ                                                                         features are giving equal average classification
procedure is described as follows. According to the                               accuracy then select fp which is individually
Bayes rule, the a posteriori probability P(ω j | x) can be
                                                 ˆ                                giving the highest accuracy.
                                                                          Step 6: Exclude feature fp from the feature set f and go
evaluated using the class conditional density function                            to Step 1 to find the next best feature until the
p(x | ω j ) and a priori probability P(ωj). If we assume that
  ˆ                                                                               average classification accuracy reaches the
the parametric distribution is normal then a posteriori                           maximum        (i.e.,   max α k (q) ≥ max α k (q + 1)
probability can be defined as Eq. 1:                                              where α k (q) is the average classification
                        1                                                         accuracy at q th iteratation).
P(ω j | x) =
        ˆ                                 exp
               (2π)   m/2     ˆ
                            | Σ j |1/ 2                       (1)
                                                                              The above algorithm will give a subset of features.
 1 ˆ ˆ ˆ+ ˆ ˆ T
 − 2 (x − µ j ) Σ j (x − µ j )  P(ω j )                                 However, if more than one subset of features is required
                                                                          then the procedure should be repeated on the remaining
                               ˆ
where, µ j is the centroid and Σ j the covariance matrix
       ˆ                                                                  features. Next, we describe materials and method.

computed from χ j . Σ + is the pseudo-inverse of Σ j
              ˆ     ˆ
                      j
                                                 ˆ
                                                                                     MATERIALS AND METHODS
                       ˆ
(which is applied when Σ j is a singular matrix). If m <
                                                                               Publicly available DNA microarray gene
       ˆ
n then Σ j will be a non-singular matrix and therefore                    expression datasets are used from Kent Ridge Bio-
conventional Σ −1 can be used in Eq. 1 instead of Σ + .
             ˆ
               j
                                                  ˆ
                                                    j
                                                                          medical           repository          (http://datam.i2r.a-
                                                                          star.edu.sg/datasets/krbd/). The program code is written
The training set χ can be partitioned into a smaller                      in Matlab on i7 dual-core Pentium processor in Linux
portion of training set χ tr and validation set χ val. The set            environment.
χ tr can be used to evaluate the parameters of equation 1
(i.e., µ j and Σ j ) and the set χ val can be used to compute
       ˆ       ˆ                                                                                  RESULTS
classification accuracy (or misclassification error) for
                                                                               In the experimentation 3 DNA microarray gene
the feature vectors defined by the subset s. The
procedure of finding feature subset is described in the                   expression datasets have been used. The description of
following algorithm:                                                      the datasets is given as follows.
                                                                               Acute leukaemia dataset (Golub et al., 1999): This
Algorithm:                                                                dataset consists of DNA microarray gene expression
                                                                          data of human acute leukaemia for cancer classification.
Step 0: Define feature set f = {f1, f2,…fd} and initialize                Two types of acute leukaemia data are provided for
        s = {} as empty set.                                              classification namely Acute Lymphoblastic Leukaemia
Step 1: Given the training feature vectors χ, partition it                (ALL) and Acute Myeloid Leukaemia (AML). The
        randomly into two segments χtr and χval using                     dataset is subdivided into 38 training samples and 34
        partitioning ratio r (we allocate approximate                     test samples. The training set consists of 38 bone
        60% of samples to χtr and the remaining in the                    marrow samples (27 ALL and 11 AML) over 7129
        other segment).                                                   probes. The test set consists of 34 samples with 20 ALL
Step 2: Take feature subset s ∪ fk (for k =1, 2, …,d) at                  and 14 AML, prepared under different experimental
        a time and compute for this feature subset the
                                                                          conditions. All the samples have 7129 dimensions and
        training parameters µ j and Σ j on χ tr segment.
                              ˆ       ˆ
                                                                          all are numeric.
                                                                    128
Am. J. Applied Sci., 9 (1): 127-131, 2012

     Lung dataset (Gordon et al., 2002): This dataset                       datasets are described in Table 1. Their corresponding
contains gene expression levels of Malignant                                classification accuracies on TRAIN data are also given.
Mesothelioma (MPM) and adenocarcinoma (ADCA) of                             The biological significance of the identified genes is
the lung. There are 181 tissue samples (31 MPM and                          depicted in the last column of the table under p-value
150 ADCA). The training set contains 32 samples, 16                         statistics using Ingenuity Pathways Analysis (IPA,
MPM and 16 ADCA. The rest of 149 samples are used                           http://www.ingenuity.com) tool. For acute leukemia
for testing. Each sample is described by 12533 genes.                       dataset the highest classification accuracy on training
     Breast cancer dataset (Van’t et al., 2002): This is a                  set is obtained at 2nd iteration which is 100%; for lung
2 class problem with 78 training samples (34 relapse                        cancer dataset, 100% classification accuracy is obtained
and 44 non-relapses) and 19 test samples (12 relapse                        at the 1st iteration; and, for breast cancer dataset,
and 7 non-relapses) of relapse and non-relapse. The                         highest classification accuracy 95% is obtained at the
dimension of breast cancer dataset is 24481.                                7th iteration. The proposed algorithm is compared with
     The proposed algorithm is used for feature                             several other existing techniques on DNA microarray
selection. For classification of test samples we use                        gene expression datasets. The performance (in terms of
Bayesian classifier (using Eq. 1) and Linear                                classification accuracy) of various techniques is
Discriminant Analysis (LDA) technique using Nearest                         depicted in Table 2. It can be observed that the
Centroid Classifier (NCC) with Euclidean distance                           proposed method is giving high classification accuracy
measure. The identified genes from all the three                            on very small number of selected features.
Table 1 Genes identified on DNA microarray gene expression datasets
                                                                                 TRAIN classification
Datasets                         Genes number/gene accession                     accuracy (%)                                  P-value
Acute leukemia                   4847 (X95735-at)                                   96.3                                       1.38e-4-3.20e-2
                                 6055 (U37055-rna1-s-at)                           100.0
Lung cancer                      2549 (32551-at)                                   100.0                                       6.90e-5-1.38e-2
Breast cancer                    10889 (AL080059)                                   75.3                                       6.61e-3-4.37e-2
                                 4436 (NM-002829)                                   83.4
                                 2295 (Connoting 34964-RC)                          86.9
                                 12455 (U90911)                                     89.4
                                 14868 (D38521)                                     92.2
                                 16795 (Connoting
                                 54916-RC)                                           93.8
                                 12817 (L41143)                                      95.0

Table 2: A Comparison of classification accuracy on (a) acute leukemia (b) lung cancer and (c) breast cancer datasets
Methods                                                                                                  Acute leukemia (Classification accuracy
(Feature selection + classification)                                # Selected genes                     on TEST data) (%)
Prediction strength+ SVMs (Furey et al., 2000)                      25-1000                              88- 94
Discretization + decision rules (Tan and Gilbert, 2003)             1038                                 91
RCBT (Cong et al., 2005)                                            10-40                                91
Neighbourhood analysis + weighted                                   50                                   85
voting (Golub et al., 1999)
CBF + decision trees (Wang et al., 2005)                            1                                    91
Information gain + Bayes classifier                                 2                                    91
Information gain + LDA with NCC                                     2                                    88
Chi-squared + Bayes classifier                                      2                                    91
Chi-squared + LDA with NCC                                          2                                    88
Proposed algorithm + Bayesian classifier                            2                                    91
Proposed algorithm + LDA with NCC                                   2                                    94
B:
                                                                                                         Lung cancer
Discretization + decision trees (Tan and Gilbert, 2003)             5365                                 93
Boosting (Li and Wong, 2003)                                        unknown                              81
Bagging (Li and Wong, 2003)                                         unknown                              88
RCBT (Cong et al., 2005)                                            10-40                                98
C4.5 (Li and Wong, 2003)                                            1                                    81
Information gain + Bayes classifier                                 1                                    89
Information gain + LDA with NCC                                     1                                    91
Chi-squared + Bayes classifier                                      1                                    77
Chi-squared + LDA with NCC                                          1                                    58
Proposed algorithm + Bayesian classifier                            1                                    89
Proposed algorithm + LDA with NCC                                   1                                    98
                                                                      129
Am. J. Applied Sci., 9 (1): 127-131, 2012

Table 2: Continue
C
                                                                                           Breast cancer
DCA (Thi et al., 2008)                                      19                             74
L1-SVM (Thi et al., 2008)                                   24045                          64
p-value of t-test + DLDA(Yan and Zheng, 2007)               50                             59
Golub + Golub (Yan and Zheng, 2007)                         50                             62
SAM + DLDA (Yan and Zheng, 2007)                            50                             58
Corr + Corr (Yan and Zheng, 2007)                           50                             62
MPAS + Marginal (Yan and Zheng, 2007)                       50                             65
MPAS + MPAS (Yan and Zheng, 2007)                           50                             69
Information gain + Bayes classifier                         7                              37
Information gain + LDA with NCC                             7                              63
Chi-squared + Bayes classifier                              7                              58
Chi-squared + LDA with NCC                                  7                              58
Proposed algorithm + Bayesian classifier                    7                              74
Proposed algorithm + LDA with NCC                           7                              74


                       DISCUSSION                                   Golub, T.R., D.K. Slonim, P. Tamayo, C. Huard and M.
                                                                         Gaasenbeek et al., 1999. Molecular classification
     A feature or gene selection algorithm using Bayes                   of cancer: Class discovery and class prediction by
classification approach has been presented. The                          gene expression monitoring. Science, 286: 531-
pseudoinverse of covariance matrix is used in place of
                                                                         537. DOI: 10.1126/science.286.5439.531
inverse covariance matrix for the class-conditional
probability density function (Eq. 1), to cater for any              Gordon, G.J., R.V. Jensen, L.L. Hsiao, S.R. Gullans
singularities of the matrix (i.e., when the number of                    and J.E. Blumenstock et al., 2002. Translation of
selected genes > number of training samples). The gene                   microarray data into clinically relevant cancer
subset is obtained in the forward selection manner. It can               diagnostic tests using gene expression ratios in
be observed that on 3 DNA microarray gene expression                     lung cancer and mesothelioma. Cancer Res., 62:
datasets, the proposed algorithm is exhibiting very                      4963-4967. PMID: 12208747
promising classification performance when compared                  Li, J. and L. Wong, 2003. Using rules to analyse bio-
with several other feature selection techniques.                         medical data: A comparison between C4.5 and
                                                                         PCL. Adv. Web-Age Inform. Manage., 2762: 254-
                      CONCLUSION
                                                                         265. DOI: 10.1007/978-3-540-45160-0_25
    A gene selection algorithm using Bayesian                       Sharma, A. and K.K. Paliwal, 2008. Cancer
classification approach has been presented. The                          classification by gradient LDA technique using
algorithm has been experimented on several DNA                           microarray gene expression data. Data Knowl.
microarray gene expression datasets and compared with                    Eng.,           66:          338-347.          DOI:
the several other existing methods. It is observed that                  10.1016/j.datak.2008.04.004
the obtained genes exhibit high classification accuracy             Sharma, A., C.H. Koh, S. Imoto and S. Miyano, 2011a.
and also show biological significance.                                   Strategy of finding optimal number of features on
                                                                         gene expression data. Elect. Lett., 47: 480-482.
                      REFERENCES
                                                                         DOI: 10.1049/el.2011.0526
Cong, G., K.L. Tan, A.K.H. Tung and X. Xu, 2005.                    Sharma, A., S. Imoto and S. Miyano, 2011b. A top-r
    Mining top-k covering rule groups for gene                           feature selection algorithm for microarray gene
    expression data. Proceedings of the ACM                              expression data. IEEE/ACM Trans. Comput. Biol.
    SIGMOD        International     Conference    on                     Bioinform. 1-1. DOI: 10.1109/TCBB.2011.151
    Management of Data, Jun. 13-17, Baltimore, MD,                  Tan, A.C. and D. Gilbert, 2003. Ensemble machine
    USA.,           pp:         670-681.        DOI:                     learning on gene expression data for cancer
    10.1145/1066157.1066234
                                                                         classification. Applied Bioinform., 2: S75-83.
Furey, T.S., N. Cristianini, N. Duffy, D.W. Bednarski
    and M. Schummer et al., 2000. Support vector                         PMID: 15130820
    machine classification and validation of cancer                 Thi, H.A.L., V.V. Nguyen and S. Ouchani, 2008. Gene
    tissue samples using microarray expression data.                     selection for cancer classification using DCA. Adv.
    Bioinformatics,       16:      906-914.     DOI:                     Data Min. Appli., 5139: 62-72. DOI: 10.1007/978-
    10.1093/bioinformatics/16.10.906                                     3-540-88192-6_8
                                                             130
Am. J. Applied Sci., 9 (1): 127-131, 2012

Van’t, V., L.J. Dai, H. Van de, M.J. Vijver and Y.D. He         Yan, X. and T. Zheng, 2007. Discriminant analysis
   et al., 2002. Gene expression profiling predicts                 using     multigene     expression    profiles   in
   clinical outcome of breast cancer. Lett. Nature.                 classification of breast cancer. Proceedings of the
   Nature, 415: 530-536. DOI: 10.1038/415530a                       International Conference on Bioinformatics and
Wang, Y., I.V. Tetko, M.A. Hall, E. Frank and A.                    Computational Biology, Jun. 25-28, CSREA Press,
   Facius et al., 2005. Gene selection from microarray              Las Vegas Nevada, USA., pp: 643-649.
   data for cancer classification-a machine learning
   approach. Comput. Biol. Chem., 29: 37-46. DOI:
   10.1016/j.compbiolchem.2004.11.001




                                                          131

More Related Content

What's hot

IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET Journal
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsinfopapers
 
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...infopapers
 
Multivariate decision tree
Multivariate decision treeMultivariate decision tree
Multivariate decision treePrafulla Shukla
 
Study of Different Multi-instance Learning kNN Algorithms
Study of Different Multi-instance Learning kNN AlgorithmsStudy of Different Multi-instance Learning kNN Algorithms
Study of Different Multi-instance Learning kNN AlgorithmsEditor IJCATR
 
An improved teaching learning
An improved teaching learningAn improved teaching learning
An improved teaching learningcsandit
 
Multi-Cluster Based Approach for skewed Data in Data Mining
Multi-Cluster Based Approach for skewed Data in Data MiningMulti-Cluster Based Approach for skewed Data in Data Mining
Multi-Cluster Based Approach for skewed Data in Data MiningIOSR Journals
 
Analyzing the impact of genetic parameters on gene grouping genetic algorithm...
Analyzing the impact of genetic parameters on gene grouping genetic algorithm...Analyzing the impact of genetic parameters on gene grouping genetic algorithm...
Analyzing the impact of genetic parameters on gene grouping genetic algorithm...Alexander Decker
 
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
Automatic Feature Subset Selection using Genetic Algorithm for ClusteringAutomatic Feature Subset Selection using Genetic Algorithm for Clustering
Automatic Feature Subset Selection using Genetic Algorithm for Clusteringidescitation
 
Chapter6.doc
Chapter6.docChapter6.doc
Chapter6.docbutest
 
Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...IJERA Editor
 
STAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final DraftSTAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final DraftJonathan Fivelsdal
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithmRespa Peter
 

What's hot (18)

35 38
35 3835 38
35 38
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
 
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
 
Multivariate decision tree
Multivariate decision treeMultivariate decision tree
Multivariate decision tree
 
Study of Different Multi-instance Learning kNN Algorithms
Study of Different Multi-instance Learning kNN AlgorithmsStudy of Different Multi-instance Learning kNN Algorithms
Study of Different Multi-instance Learning kNN Algorithms
 
Hc3413121317
Hc3413121317Hc3413121317
Hc3413121317
 
An improved teaching learning
An improved teaching learningAn improved teaching learning
An improved teaching learning
 
Ijetr042111
Ijetr042111Ijetr042111
Ijetr042111
 
Multi-Cluster Based Approach for skewed Data in Data Mining
Multi-Cluster Based Approach for skewed Data in Data MiningMulti-Cluster Based Approach for skewed Data in Data Mining
Multi-Cluster Based Approach for skewed Data in Data Mining
 
Analyzing the impact of genetic parameters on gene grouping genetic algorithm...
Analyzing the impact of genetic parameters on gene grouping genetic algorithm...Analyzing the impact of genetic parameters on gene grouping genetic algorithm...
Analyzing the impact of genetic parameters on gene grouping genetic algorithm...
 
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
Automatic Feature Subset Selection using Genetic Algorithm for ClusteringAutomatic Feature Subset Selection using Genetic Algorithm for Clustering
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
 
Chapter6.doc
Chapter6.docChapter6.doc
Chapter6.doc
 
I0704047054
I0704047054I0704047054
I0704047054
 
B017410916
B017410916B017410916
B017410916
 
Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...Comparision of methods for combination of multiple classifiers that predict b...
Comparision of methods for combination of multiple classifiers that predict b...
 
STAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final DraftSTAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final Draft
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
 

Viewers also liked

La violencia escolar en jurado choco
La violencia escolar en jurado chocoLa violencia escolar en jurado choco
La violencia escolar en jurado chocoYessica Martinez Mena
 
1. codes & conventions
1. codes & conventions1. codes & conventions
1. codes & conventionsSammo_644
 
La sabiesa del vell
La sabiesa del vellLa sabiesa del vell
La sabiesa del vellPilar Morey
 
Web tek labs profile v2.3
Web tek labs profile v2.3Web tek labs profile v2.3
Web tek labs profile v2.3Mitesh Bhatia
 
St_Pauls_Board_Advice
St_Pauls_Board_AdviceSt_Pauls_Board_Advice
St_Pauls_Board_AdviceWayne Irwin
 
Five kingdom classification
Five kingdom classificationFive kingdom classification
Five kingdom classificationSano Anil
 
Much or many (1)
Much or many (1)Much or many (1)
Much or many (1)ahmadeg1
 
Annuity and Life Insurance Product Update - Q2 2014
Annuity and Life Insurance Product Update - Q2 2014Annuity and Life Insurance Product Update - Q2 2014
Annuity and Life Insurance Product Update - Q2 2014Corporate Insight
 
Paper Champs
Paper ChampsPaper Champs
Paper ChampsLuke_33
 
Did you know every BODY needs Fiber?
Did you know every BODY needs Fiber?Did you know every BODY needs Fiber?
Did you know every BODY needs Fiber?Cindy McAsey
 

Viewers also liked (16)

presentation title
presentation titlepresentation title
presentation title
 
La violencia escolar en jurado choco
La violencia escolar en jurado chocoLa violencia escolar en jurado choco
La violencia escolar en jurado choco
 
Nokia 2600C specification
Nokia 2600C specificationNokia 2600C specification
Nokia 2600C specification
 
1. codes & conventions
1. codes & conventions1. codes & conventions
1. codes & conventions
 
La sabiesa del vell
La sabiesa del vellLa sabiesa del vell
La sabiesa del vell
 
Web tek labs profile v2.3
Web tek labs profile v2.3Web tek labs profile v2.3
Web tek labs profile v2.3
 
St_Pauls_Board_Advice
St_Pauls_Board_AdviceSt_Pauls_Board_Advice
St_Pauls_Board_Advice
 
Ppt surumi
Ppt surumiPpt surumi
Ppt surumi
 
CARP Project
CARP ProjectCARP Project
CARP Project
 
Five kingdom classification
Five kingdom classificationFive kingdom classification
Five kingdom classification
 
Much or many (1)
Much or many (1)Much or many (1)
Much or many (1)
 
Strategic management
Strategic managementStrategic management
Strategic management
 
Annuity and Life Insurance Product Update - Q2 2014
Annuity and Life Insurance Product Update - Q2 2014Annuity and Life Insurance Product Update - Q2 2014
Annuity and Life Insurance Product Update - Q2 2014
 
Paper Champs
Paper ChampsPaper Champs
Paper Champs
 
Did you know every BODY needs Fiber?
Did you know every BODY needs Fiber?Did you know every BODY needs Fiber?
Did you know every BODY needs Fiber?
 
Pompa
PompaPompa
Pompa
 

Similar to Ajas11 alok

RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNINGRECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNINGbutest
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionIOSR Journals
 
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...cscpconf
 
International Journal of Computer Science and Security Volume (2) Issue (5)
International Journal of Computer Science and Security Volume (2) Issue (5)International Journal of Computer Science and Security Volume (2) Issue (5)
International Journal of Computer Science and Security Volume (2) Issue (5)CSCJournals
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...Loc Nguyen
 
Data mining classifiers.
Data mining classifiers.Data mining classifiers.
Data mining classifiers.ShwetaPatil174
 
Analysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data SetAnalysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
A general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsA general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsinfopapers
 
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...IJCSEA Journal
 
Feature selection of unbalanced breast cancer data using particle swarm optim...
Feature selection of unbalanced breast cancer data using particle swarm optim...Feature selection of unbalanced breast cancer data using particle swarm optim...
Feature selection of unbalanced breast cancer data using particle swarm optim...IJECEIAES
 
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
2014 Gene expressionmicroarrayclassification usingPCA–BEL.2014 Gene expressionmicroarrayclassification usingPCA–BEL.
2014 Gene expressionmicroarrayclassification usingPCA–BEL.Ehsan Lotfi
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...IJAEMSJORNAL
 

Similar to Ajas11 alok (20)

RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNINGRECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature Selection
 
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...
SVM-PSO based Feature Selection for Improving Medical Diagnosis Reliability u...
 
International Journal of Computer Science and Security Volume (2) Issue (5)
International Journal of Computer Science and Security Volume (2) Issue (5)International Journal of Computer Science and Security Volume (2) Issue (5)
International Journal of Computer Science and Security Volume (2) Issue (5)
 
Classifiers
ClassifiersClassifiers
Classifiers
 
Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...
 
Ijetcas14 327
Ijetcas14 327Ijetcas14 327
Ijetcas14 327
 
303 306
303 306303 306
303 306
 
Data mining classifiers.
Data mining classifiers.Data mining classifiers.
Data mining classifiers.
 
Analysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data SetAnalysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data Set
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Svm dbeth
Svm dbethSvm dbeth
Svm dbeth
 
A general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsA general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernels
 
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
 
Gene expression profiling ii
Gene expression profiling  iiGene expression profiling  ii
Gene expression profiling ii
 
Feature selection of unbalanced breast cancer data using particle swarm optim...
Feature selection of unbalanced breast cancer data using particle swarm optim...Feature selection of unbalanced breast cancer data using particle swarm optim...
Feature selection of unbalanced breast cancer data using particle swarm optim...
 
PSO.ppt
PSO.pptPSO.ppt
PSO.ppt
 
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
2014 Gene expressionmicroarrayclassification usingPCA–BEL.2014 Gene expressionmicroarrayclassification usingPCA–BEL.
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...
 

More from vantinhkhuc (20)

Url programming
Url programmingUrl programming
Url programming
 
Servlets intro
Servlets introServlets intro
Servlets intro
 
Servlet sessions
Servlet sessionsServlet sessions
Servlet sessions
 
Security overview
Security overviewSecurity overview
Security overview
 
Rmi
RmiRmi
Rmi
 
Md5
Md5Md5
Md5
 
Lecture17
Lecture17Lecture17
Lecture17
 
Lecture11 b
Lecture11 bLecture11 b
Lecture11 b
 
Lecture10
Lecture10Lecture10
Lecture10
 
Lecture9
Lecture9Lecture9
Lecture9
 
Lecture6
Lecture6Lecture6
Lecture6
 
Jsse
JsseJsse
Jsse
 
Jsf intro
Jsf introJsf intro
Jsf intro
 
Jsp examples
Jsp examplesJsp examples
Jsp examples
 
Jpa
JpaJpa
Jpa
 
Ejb examples
Ejb examplesEjb examples
Ejb examples
 
Corba
CorbaCorba
Corba
 
Ajax
AjaxAjax
Ajax
 
Ejb intro
Ejb introEjb intro
Ejb intro
 
Chc6b0c6a1ng 12
Chc6b0c6a1ng 12Chc6b0c6a1ng 12
Chc6b0c6a1ng 12
 

Ajas11 alok

  • 1. American Journal of Applied Sciences 9 (1): 127-131, 2012 ISSN 1546-9239 © 2012 Science Publications A Gene Selection Algorithm using Bayesian Classification Approach 1, 2 Alok Sharma and 2Kuldip K. Paliwal 1 School of Engineering and Physics, Faculty of Science Technology and Environment, University of the South Pacific, Fiji 2 Signal Processing Lab, School of Engineering, Faculty of Engineering and Information Technology, Griffith University, Australia Abstract: In this study, we propose a new feature (or gene) selection algorithm using Bayes classification approach. The algorithm can find gene subset crucial for cancer classification problem. Problem statement: Gene identification plays important role in human cancer classification problem. Several feature selection algorithms have been proposed for analyzing and understanding influential genes using gene expression profiles. Approach: The feature selection algorithms aim to explore genes that are crucial for accurate cancer classification and also endure biological significance. However, the performance of the algorithms is still limited. In this study, we propose a feature selection algorithm using Bayesian classification approach. Results: This approach gives promising results on gene expression datasets and compares favorably with respect to several other existing techniques. Conclusion: The proposed gene selection algorithm using Bayes classification approach is shown to find important genes that can provide high classification accuracy on DNA microarray gene expression datasets. Key words: Bayesian classifier, classification accuracy, feature selection, existing techniques, Bayesian classification, selection algorithms, biological significance, still limited, tissue samples INTRODUCTION microarray datasets the performance is still limited and hence the improvements are necessitated. The classification of tissue samples into one of the In this study, we propose a feature selection several classes or subclasses using their gene algorithm using Bayesian classification approach. The expression profile is an important task and has been proposed scheme begins at an empty feature subset and attracted widespread attention (Sharma and Paliwal, includes a feature that provides the maximum 2008). The gene expression profiles measured through information to the current subset. The process of DNA microarray technology provide accurate, reliable including features is terminated when no feature can and objective cancer classification. It is also possible to add information to the current subset. The bays uncover cancer subclasses that are related with the classifier is used to judge the merit of features. It is efficacy of anti-cancer drugs that are hard to be considered to be the optimum classifier. However, the predicted by pathological tests. The feature selection bays classifier using normal distribution could suffer algorithms are considered to be an important way of from inverse operation of sample covariance matrix due identifying crucial genes. Various feature selection to scarce training samples. However, this problem can be algorithms have been proposed in the literature with resolved by regularization techniques or pseudo some advantages and disadvantages (Sharma et al., inversing covariance matrix. The proposed algorithm is 2011b; Tan and Gilbert, 2003; Cong et al., 2005; Golub experimented on several publically available microarray et al., 1999; Wang et al., 2005; Li and Wong, 2003; Thi datasets and promising results have been obtained when et al., 2008; Yan and Zheng, 2007; Sharma et al., compared with other feature selection algorithms. 2011a). These methods select important genes using some objective functions. The selected genes are Proposed strategy: The purpose of the algorithm is to expected to have biological significance and should select a subset of features s = {s1, s2,…,sm} from the provide high classification accuracy. However, on many original feature set f = {f1, f2,…,fd} where d is the Corresponding Author: Alok Sharma, School of Engineering and Physics, Faculty of Science Technology and Environment, University of the South Pacific, Fiji 127
  • 2. Am. J. Applied Sci., 9 (1): 127-131, 2012 dimension of feature vectors and m<d is the number of Step 3: By using Eq. 1, compute classification selected features. A feature fk is included in the subset s, accuracy αk using feature subset s ∪ fk on χ val if for this fk, the subset s gives the highest classification segment. accuracy (or the lowest misclassification error). Let χ Step 4: Repeat Steps 1-3 N times (we use N = 10) to ={x1, x2,…xn} be the training sample set where each xi get an average classification accuracy α k (for is a d-dimensional vector. Let x i ∈ ℜ m be the ˆ k = 1, 2, …, d). corresponding vector having its features defined by Step 5: Allocate the feature to the subset s which gives subset s. Let Ω = {ωj; j = 1, 2 ,…c} be the finite set of highest α k , i.e., p = arg max α k and include fp in c classes and χ j be the set of m-dimensional training ˆ subset s; i.e., s ⇐ s∪ fp. If two or more vectors x i of class ωj. The Bayesian classification ˆ features are giving equal average classification procedure is described as follows. According to the accuracy then select fp which is individually Bayes rule, the a posteriori probability P(ω j | x) can be ˆ giving the highest accuracy. Step 6: Exclude feature fp from the feature set f and go evaluated using the class conditional density function to Step 1 to find the next best feature until the p(x | ω j ) and a priori probability P(ωj). If we assume that ˆ average classification accuracy reaches the the parametric distribution is normal then a posteriori maximum (i.e., max α k (q) ≥ max α k (q + 1) probability can be defined as Eq. 1: where α k (q) is the average classification 1 accuracy at q th iteratation). P(ω j | x) = ˆ exp (2π) m/2 ˆ | Σ j |1/ 2 (1) The above algorithm will give a subset of features.  1 ˆ ˆ ˆ+ ˆ ˆ T  − 2 (x − µ j ) Σ j (x − µ j )  P(ω j ) However, if more than one subset of features is required then the procedure should be repeated on the remaining ˆ where, µ j is the centroid and Σ j the covariance matrix ˆ features. Next, we describe materials and method. computed from χ j . Σ + is the pseudo-inverse of Σ j ˆ ˆ j ˆ MATERIALS AND METHODS ˆ (which is applied when Σ j is a singular matrix). If m < Publicly available DNA microarray gene ˆ n then Σ j will be a non-singular matrix and therefore expression datasets are used from Kent Ridge Bio- conventional Σ −1 can be used in Eq. 1 instead of Σ + . ˆ j ˆ j medical repository (http://datam.i2r.a- star.edu.sg/datasets/krbd/). The program code is written The training set χ can be partitioned into a smaller in Matlab on i7 dual-core Pentium processor in Linux portion of training set χ tr and validation set χ val. The set environment. χ tr can be used to evaluate the parameters of equation 1 (i.e., µ j and Σ j ) and the set χ val can be used to compute ˆ ˆ RESULTS classification accuracy (or misclassification error) for In the experimentation 3 DNA microarray gene the feature vectors defined by the subset s. The procedure of finding feature subset is described in the expression datasets have been used. The description of following algorithm: the datasets is given as follows. Acute leukaemia dataset (Golub et al., 1999): This Algorithm: dataset consists of DNA microarray gene expression data of human acute leukaemia for cancer classification. Step 0: Define feature set f = {f1, f2,…fd} and initialize Two types of acute leukaemia data are provided for s = {} as empty set. classification namely Acute Lymphoblastic Leukaemia Step 1: Given the training feature vectors χ, partition it (ALL) and Acute Myeloid Leukaemia (AML). The randomly into two segments χtr and χval using dataset is subdivided into 38 training samples and 34 partitioning ratio r (we allocate approximate test samples. The training set consists of 38 bone 60% of samples to χtr and the remaining in the marrow samples (27 ALL and 11 AML) over 7129 other segment). probes. The test set consists of 34 samples with 20 ALL Step 2: Take feature subset s ∪ fk (for k =1, 2, …,d) at and 14 AML, prepared under different experimental a time and compute for this feature subset the conditions. All the samples have 7129 dimensions and training parameters µ j and Σ j on χ tr segment. ˆ ˆ all are numeric. 128
  • 3. Am. J. Applied Sci., 9 (1): 127-131, 2012 Lung dataset (Gordon et al., 2002): This dataset datasets are described in Table 1. Their corresponding contains gene expression levels of Malignant classification accuracies on TRAIN data are also given. Mesothelioma (MPM) and adenocarcinoma (ADCA) of The biological significance of the identified genes is the lung. There are 181 tissue samples (31 MPM and depicted in the last column of the table under p-value 150 ADCA). The training set contains 32 samples, 16 statistics using Ingenuity Pathways Analysis (IPA, MPM and 16 ADCA. The rest of 149 samples are used http://www.ingenuity.com) tool. For acute leukemia for testing. Each sample is described by 12533 genes. dataset the highest classification accuracy on training Breast cancer dataset (Van’t et al., 2002): This is a set is obtained at 2nd iteration which is 100%; for lung 2 class problem with 78 training samples (34 relapse cancer dataset, 100% classification accuracy is obtained and 44 non-relapses) and 19 test samples (12 relapse at the 1st iteration; and, for breast cancer dataset, and 7 non-relapses) of relapse and non-relapse. The highest classification accuracy 95% is obtained at the dimension of breast cancer dataset is 24481. 7th iteration. The proposed algorithm is compared with The proposed algorithm is used for feature several other existing techniques on DNA microarray selection. For classification of test samples we use gene expression datasets. The performance (in terms of Bayesian classifier (using Eq. 1) and Linear classification accuracy) of various techniques is Discriminant Analysis (LDA) technique using Nearest depicted in Table 2. It can be observed that the Centroid Classifier (NCC) with Euclidean distance proposed method is giving high classification accuracy measure. The identified genes from all the three on very small number of selected features. Table 1 Genes identified on DNA microarray gene expression datasets TRAIN classification Datasets Genes number/gene accession accuracy (%) P-value Acute leukemia 4847 (X95735-at) 96.3 1.38e-4-3.20e-2 6055 (U37055-rna1-s-at) 100.0 Lung cancer 2549 (32551-at) 100.0 6.90e-5-1.38e-2 Breast cancer 10889 (AL080059) 75.3 6.61e-3-4.37e-2 4436 (NM-002829) 83.4 2295 (Connoting 34964-RC) 86.9 12455 (U90911) 89.4 14868 (D38521) 92.2 16795 (Connoting 54916-RC) 93.8 12817 (L41143) 95.0 Table 2: A Comparison of classification accuracy on (a) acute leukemia (b) lung cancer and (c) breast cancer datasets Methods Acute leukemia (Classification accuracy (Feature selection + classification) # Selected genes on TEST data) (%) Prediction strength+ SVMs (Furey et al., 2000) 25-1000 88- 94 Discretization + decision rules (Tan and Gilbert, 2003) 1038 91 RCBT (Cong et al., 2005) 10-40 91 Neighbourhood analysis + weighted 50 85 voting (Golub et al., 1999) CBF + decision trees (Wang et al., 2005) 1 91 Information gain + Bayes classifier 2 91 Information gain + LDA with NCC 2 88 Chi-squared + Bayes classifier 2 91 Chi-squared + LDA with NCC 2 88 Proposed algorithm + Bayesian classifier 2 91 Proposed algorithm + LDA with NCC 2 94 B: Lung cancer Discretization + decision trees (Tan and Gilbert, 2003) 5365 93 Boosting (Li and Wong, 2003) unknown 81 Bagging (Li and Wong, 2003) unknown 88 RCBT (Cong et al., 2005) 10-40 98 C4.5 (Li and Wong, 2003) 1 81 Information gain + Bayes classifier 1 89 Information gain + LDA with NCC 1 91 Chi-squared + Bayes classifier 1 77 Chi-squared + LDA with NCC 1 58 Proposed algorithm + Bayesian classifier 1 89 Proposed algorithm + LDA with NCC 1 98 129
  • 4. Am. J. Applied Sci., 9 (1): 127-131, 2012 Table 2: Continue C Breast cancer DCA (Thi et al., 2008) 19 74 L1-SVM (Thi et al., 2008) 24045 64 p-value of t-test + DLDA(Yan and Zheng, 2007) 50 59 Golub + Golub (Yan and Zheng, 2007) 50 62 SAM + DLDA (Yan and Zheng, 2007) 50 58 Corr + Corr (Yan and Zheng, 2007) 50 62 MPAS + Marginal (Yan and Zheng, 2007) 50 65 MPAS + MPAS (Yan and Zheng, 2007) 50 69 Information gain + Bayes classifier 7 37 Information gain + LDA with NCC 7 63 Chi-squared + Bayes classifier 7 58 Chi-squared + LDA with NCC 7 58 Proposed algorithm + Bayesian classifier 7 74 Proposed algorithm + LDA with NCC 7 74 DISCUSSION Golub, T.R., D.K. Slonim, P. Tamayo, C. Huard and M. Gaasenbeek et al., 1999. Molecular classification A feature or gene selection algorithm using Bayes of cancer: Class discovery and class prediction by classification approach has been presented. The gene expression monitoring. Science, 286: 531- pseudoinverse of covariance matrix is used in place of 537. DOI: 10.1126/science.286.5439.531 inverse covariance matrix for the class-conditional probability density function (Eq. 1), to cater for any Gordon, G.J., R.V. Jensen, L.L. Hsiao, S.R. Gullans singularities of the matrix (i.e., when the number of and J.E. Blumenstock et al., 2002. Translation of selected genes > number of training samples). The gene microarray data into clinically relevant cancer subset is obtained in the forward selection manner. It can diagnostic tests using gene expression ratios in be observed that on 3 DNA microarray gene expression lung cancer and mesothelioma. Cancer Res., 62: datasets, the proposed algorithm is exhibiting very 4963-4967. PMID: 12208747 promising classification performance when compared Li, J. and L. Wong, 2003. Using rules to analyse bio- with several other feature selection techniques. medical data: A comparison between C4.5 and PCL. Adv. Web-Age Inform. Manage., 2762: 254- CONCLUSION 265. DOI: 10.1007/978-3-540-45160-0_25 A gene selection algorithm using Bayesian Sharma, A. and K.K. Paliwal, 2008. Cancer classification approach has been presented. The classification by gradient LDA technique using algorithm has been experimented on several DNA microarray gene expression data. Data Knowl. microarray gene expression datasets and compared with Eng., 66: 338-347. DOI: the several other existing methods. It is observed that 10.1016/j.datak.2008.04.004 the obtained genes exhibit high classification accuracy Sharma, A., C.H. Koh, S. Imoto and S. Miyano, 2011a. and also show biological significance. Strategy of finding optimal number of features on gene expression data. Elect. Lett., 47: 480-482. REFERENCES DOI: 10.1049/el.2011.0526 Cong, G., K.L. Tan, A.K.H. Tung and X. Xu, 2005. Sharma, A., S. Imoto and S. Miyano, 2011b. A top-r Mining top-k covering rule groups for gene feature selection algorithm for microarray gene expression data. Proceedings of the ACM expression data. IEEE/ACM Trans. Comput. Biol. SIGMOD International Conference on Bioinform. 1-1. DOI: 10.1109/TCBB.2011.151 Management of Data, Jun. 13-17, Baltimore, MD, Tan, A.C. and D. Gilbert, 2003. Ensemble machine USA., pp: 670-681. DOI: learning on gene expression data for cancer 10.1145/1066157.1066234 classification. Applied Bioinform., 2: S75-83. Furey, T.S., N. Cristianini, N. Duffy, D.W. Bednarski and M. Schummer et al., 2000. Support vector PMID: 15130820 machine classification and validation of cancer Thi, H.A.L., V.V. Nguyen and S. Ouchani, 2008. Gene tissue samples using microarray expression data. selection for cancer classification using DCA. Adv. Bioinformatics, 16: 906-914. DOI: Data Min. Appli., 5139: 62-72. DOI: 10.1007/978- 10.1093/bioinformatics/16.10.906 3-540-88192-6_8 130
  • 5. Am. J. Applied Sci., 9 (1): 127-131, 2012 Van’t, V., L.J. Dai, H. Van de, M.J. Vijver and Y.D. He Yan, X. and T. Zheng, 2007. Discriminant analysis et al., 2002. Gene expression profiling predicts using multigene expression profiles in clinical outcome of breast cancer. Lett. Nature. classification of breast cancer. Proceedings of the Nature, 415: 530-536. DOI: 10.1038/415530a International Conference on Bioinformatics and Wang, Y., I.V. Tetko, M.A. Hall, E. Frank and A. Computational Biology, Jun. 25-28, CSREA Press, Facius et al., 2005. Gene selection from microarray Las Vegas Nevada, USA., pp: 643-649. data for cancer classification-a machine learning approach. Comput. Biol. Chem., 29: 37-46. DOI: 10.1016/j.compbiolchem.2004.11.001 131