SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
ARTICLE IN PRESS
                                                                        Pattern Recognition ] (]]]]) ]]]–]]]



                                                              Contents lists available at ScienceDirect


                                                                  Pattern Recognition
                                                       journal homepage: www.elsevier.com/locate/pr




Learn++.MF: A random subspace approach for the missing feature problem
Robi Polikar a,n, Joseph DePasquale a, Hussein Syed Mohammed a,
Gavin Brown b, Ludmilla I. Kuncheva c
a
  Electrical and Computer Eng., Rowan University, 201 Mullica Hill Road, Glassboro, NJ 08028, USA
b
  University of Manchester, Manchester, England, UK
c
  University of Bangor, Bangor, Wales, UK




a r t i c l e in fo                                     abstract

Article history:                                        We introduce Learn+ .MF, an ensemble-of-classifiers based algorithm that employs random subspace
                                                                             +

Received 9 November 2009                                selection to address the missing feature problem in supervised classification. Unlike most established
Received in revised form                                approaches, Learn+ .MF does not replace missing values with estimated ones, and hence does not need
                                                                           +
16 April 2010
                                                        specific assumptions on the underlying data distribution. Instead, it trains an ensemble of classifiers,
Accepted 21 May 2010
                                                        each on a random subset of the available features. Instances with missing values are classified by the
                                                        majority voting of those classifiers whose training data did not include the missing features. We show
Keywords:                                               that Learn+ .MF can accommodate substantial amount of missing data, and with only gradual decline in
                                                                   +

Missing data                                            performance as the amount of missing data increases. We also analyze the effect of the cardinality of
Missing features
                                                        the random feature subsets, and the ensemble size on algorithm performance. Finally, we discuss the
Ensemble of classifiers
                                                        conditions under which the proposed approach is most effective.
Random subspace method
                                                                                                                       & 2010 Elsevier Ltd. All rights reserved.




1. Introduction                                                                              Fig. 1 illustrates such a scenario for a handwritten character
                                                                                          recognition application: characters are digitized on an 8 Â 8 grid,
1.1. The missing feature problem                                                          creating 64 features, f1–f64, a random subset (of about 20–30%) of
                                                                                          which – indicated in orange (light shading) – are missing in each
    The integrity and completeness of data are essential for any                          case. Having such a large proportion of randomly varying features
classification algorithm. After all, a trained classifier – unless                          may be viewed as an extreme and unlikely scenario, warranting
specifically designed to address this issue – cannot process                               reacquisition of the entire dataset. However, data reacquisition is
instances with missing features, as the missing number(s) in the                          often expensive, impractical, or sometimes even impossible,
input vectors would make the matrix operations involved in data                           justifying the need for an alternative practical solution.
processing impossible. To obtain a valid classification, the data to                          The classification algorithm described in this paper is designed
be classified should be complete with no missing features                                  to provide such a practical solution, accommodating missing
(henceforth, we use missing data and missing features inter-                              features subject to the condition of distributed redundancy
changeably). Missing data in real world applications is not                               (discussed in Section 3), which is satisfied surprisingly often in
uncommon: bad sensors, failed pixels, unanswered questions in                             practice.
surveys, malfunctioning equipment, medical tests that cannot be
administered under certain conditions, etc. are all familiar                              1.2. Current techniques for accommodating missing data
scenarios in practice that can result in missing features. Feature
values that are beyond the expected dynamic range of the data                                The simplest approach for dealing with missing data is to
due to extreme noise, signal saturation, data corruption, etc. can                        ignore those instances with missing attributes. Commonly
also be treated as missing data. Furthermore, if the entire data are                      referred to as filtering or list wise deletion approaches, such
not acquired under identical conditions (time/location, using the                         techniques are clearly suboptimal when a large portion of the
same equipment, etc.), different data instances may be missing                            data have missing attributes [1], and of course are infeasible, if
different features.                                                                       each instance is missing at least one or more features. A more
                                                                                          pragmatic approach commonly used to accommodate missing
                                                                                          data is imputation [2–5]: substitute the missing value with a
                                                                                          meaningful estimate. Traditional examples of this approach
    n
        Corresponding author: Tel: + 1 856 256 5372; fax: + 1 856 256 5241.               include replacing the missing value with one of the existing data
        E-mail address: polikar@rowan.edu (R. Polikar).                                   points (most similar in some measure) as in hot – deck imputation

0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2010.05.028


    Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern
    Recognition (2010), doi:10.1016/j.patcog.2010.05.028
ARTICLE IN PRESS
2                                                         R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]]




Fig. 1. Handwritten character recognition as an illustration of the missing feature problem addressed in this study: a large proportion of feature values are missing
(orange/shaded), and the missing features vary randomly from one instance to another (the actual characters are the numbers 9, 7 and 6). For interpretation of the
references to colour in this figure legend, the reader is referred to the web version of this article.



[2,6,7], the mean of that attribute across the data, or the mean of                      using the existing features) by calculating the fuzzy membership
its k-nearest neighbors [4]. In order for the estimate to be a                           of the data point to its nearest neighbors, clusters or hyperboxes.
faithful representative of the missing value, however, the training                      The parameters of the clusters and hyperboxes are determined
data must be sufficiently dense, a requirement rarely satisfied for                        from the existing data. Algorithms based on the general fuzzy
datasets with even modest number of features. Furthermore,                               min–max neural networks [20] or ARTMAP and fuzzy c-means
imputation methods are known to produce biased estimates as                              clustering [21] are examples of this approach.
the proportion of missing data increases. A related, but perhaps a                           More recently, ensemble based approaches have also been
more rigorous approach is to use polynomial regression to                                proposed. For example, Melville et al. [22] showed that the
estimate substitutes for missing values [8]. However, regression                         algorithm DECORATE, which generates artificial data (with no
techniques – besides being difficult to implement in high                                 missing values) from existing data (with missing values) is quite
dimensions – assume that the data can reasonably fit to a                                 robust to missing data. On the other hand, Juszczak and Duin [23]
particular polynomial, which may not hold for many applications.                         proposed combining an ensemble of one-class classifiers, each
    Theoretically rigorous methods with provable performance                             trained on a single feature. This approach is capable of handling
guarantees have also been developed. Many of these methods                               any combination of missing features, with the fewest number of
rely on model based estimation, such as Bayesian estimation                              classifiers possible. The approach can be very effective as long as
[9–11], which calculates posterior probabilities by integrating                          single features can reasonably estimate the underlying decision
over the missing portions of the feature space. Such an approach                         boundaries, which is not always plausible.
also requires a sufficiently dense data distribution; but more                                In this contribution, we propose an alternative strategy, called
importantly, a prior distribution for all unknown parameters must                        Learn++.MF. It is inspired in part by our previously introduced
be specified, which requires prior knowledge. Such knowledge is                           ensemble-of-classifiers based incremental learning algorithm,
typically vague or non-existent, and inaccurate choices often lead                       Learn++, and in part by the random subspace method (RSM). In
to inferior performance.                                                                 essence, Learn++.MF generates a sufficient number of classifiers,
    An alternative strategy in model based methods is the                                each trained on a random subset of features. An instance with one
Expectation Maximization (EM) algorithm [12–14,10], justly                               or more missing attributes is then classified by majority voting of
admired for its theoretical elegance. EM consists of two steps,                          those classifiers that did not use the missing features during their
the expectation (E) and the maximization (M) step, which                                 training. Hence, this approach differs in one fundamental aspect
iteratively maximize the expectation of the log-likelihood of the                        from the other techniques: Learn++.MF does not try to estimate the
complete data, conditioned on observed data. Conceptually, these                         values of the missing data; instead it tries to extract the most
steps are easy to construct, and the range of problems that can be                       discriminatory classification information provided by the existing
handled by EM is quite broad. However, there are two potential                           data, by taking advantage of a presumed redundancy in the
drawbacks: (i) convergence can be slow if large portions of data                         feature set. Hence, Learn+ .MF avoids many of the pitfalls of
                                                                                                                       +

are missing; and (ii) the M step may be quite difficult if a closed                       estimation and imputation based techniques.
form of the distribution is unknown, or if different instances are                           As in most missing feature approaches, we assume that the
missing different features. In such cases, the theoretical simplicity                    probability of a feature being missing is independent of the value
of EM does not translate into practice [2]. EM also requires prior                       of that or any other feature in the dataset. This model is referred
knowledge that is often unavailable, or estimation of an unknown                         to as missing completely at random (MCAR) [2]. Hence, given the
underlying distribution, which may be computationally prohibi-                           dataset X¼(xij) where xij represents the jth feature of instance xi,
tive for large dimensional datasets. Incorrect distribution estima-                      and a missing data indicator matrix M¼(Mij), where Mij is 1 if xij is
tion often leads to inconsistent results, whereas lack of                                missing and 0 otherwise, we assume that pðM9XÞ ¼ pðMÞ. This is
sufficiently dense data typically causes loss of accuracy. Several                        the most restrictive mechanism that generates missing data, the
variations have been proposed, such as using Gaussian mixture                            one that provides us with no additional information. However,
models [15,16]; or Expected Conditional Maximization, to                                 this is also the only case where list-wise deletion or most
mitigate some of these difficulties [2].                                                  imputation approaches lead to no bias.
    Neural network based approaches have also been proposed.                                 In the rest of this paper, we first provide a review of ensemble
Gupta and Lam [17] looked at weight decay on datasets with                               systems, followed by the description of the Learn++.MF algorithm,
missing values, whereas Yoon and Lee [18] proposed the Training-                         and provide a theoretical analysis to guide the choice of its free
EStimation-Training (TEST) algorithm, which predicts the actual                          parameters: number of features used to train each classifier, and
target value from a reasonably well estimated imputed one. Other                         the ensemble size. We then present results, analyze algorithm
approaches use neuro-fuzzy algorithms [19], where unknown                                performance with respect to its free parameters, and compare
values of the data are either estimated (or a classification is made                      performance metrics with theoretically expected values. We also

    Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern
    Recognition (2010), doi:10.1016/j.patcog.2010.05.028
ARTICLE IN PRESS
                                                   R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]]                                           3


compare the performance of Learn+ .MF to that of a single classifier
                                 +
                                                                                  feature(s) needs to be classified, only those classifiers trained with
using mean imputation, to naıve Bayes that can naturally handle
                            ¨                                                     the features that are presently available in x are used for the
missing features, and an intermediate approach that combines                      classification. As such, Learn+ .MF follows an alternative paradigm
                                                                                                                +

RSM with mean imputation, as well as to that of one-class                         for the solution of the missing feature problem: the algorithm
ensemble approach of [23]. We conclude with discussions and                       tries to extract most of the information from the available
remarks.                                                                          features, instead of trying to estimate, or impute, the values of
                                                                                  the missing ones. Hence, Learn+ .MF does not require a very dense
                                                                                                                    +

                                                                                  training data (though, such data are certainly useful); it does not
2. Background: ensemble and random subspace methods                               need to know or estimate the underlying distributions; and hence
                                                                                  does not suffer from the adverse consequences of a potential
    An ensemble-based system is obtained by combining diverse                     incorrect estimate.
classifiers. The diversity in the classifiers, typically achieved by                    Learn++.MF makes two assumptions. First, the feature set must
using different training parameters for each classifier, allows                    be redundant, that is, the classification problem is solvable with
individual classifiers to generate different decision boundaries                   unknown subset(s) of the available features (of course, if the
[24–27]. If proper diversity is achieved, a different error is made               identities of the redundant features were known, they would have
by each classifier, strategic combination of which can then reduce                 already been excluded from the feature set). Second, the
the total error. Starting in early 1990s, with such seminal works as              redundancy is distributed randomly over the feature set (hence,
[28–30,13,31–33], ensemble systems have since become an                           time series data would not be well suited for this approach). These
important research area in machine learning, whose recent                         assumptions are primarily due to the random nature of feature
reviews can be found in [34–36].                                                  selection, and are shared with all RSM based approaches. We
    The diversity requirement can be achieved in several ways                     combine these two assumptions under the name distributed
[24–27]: training individual classifiers using different (i) training              redundancy. Such datasets are not uncommon in practice (the
data (sub)sets; (ii) parameters of a given classifier architecture;                scenario in Fig. 1, sensor applications, etc), and it is these
(iii) classifier models; or (iv) subsets of features. The last one, also           applications for which Learn+ .MF is designed, and expected to
                                                                                                                  +

known as the random subspace method (RSM), was originally                         be most effective.
proposed by Ho [37] for constructing an ensemble of decision
trees (decision forests). In RSM, classifiers are trained using
different random subsets of the features, allowing classifiers to err              3.2. Algorithm Learn+ .MF
                                                                                                       +

in different sub-domains of the feature space. Skurichina points
out that RSM works particularly well when the database provides                       A trivial approach is to create a classifier for each of the 2f À 1
redundant information that is ‘‘dispersed’’ across all features [38].             possible non-empty subsets of the f features. Then, for any
    The RSM has several desirable attributes: (i) working in a                    instance with missing features, use those classifiers whose
reduced dimensional space also reduces the adverse conse-                         training feature set did not include the missing ones. Such an
quences of the curse of dimensionality; (ii) RSM based ensemble                   approach, perfectly suitable for datasets with few features, has in
classifiers may provide improved classification performance,                        fact been previously proposed [44]. For large feature sets, this
because the synergistic classification capacity of a diverse set of                exhaustive approach is infeasible. Besides, the probability of a
classifiers compensates for any discriminatory information lost by                 particular feature combination being missing also diminishes
choosing a smaller feature subset; (iii) implementing RSM is quite                exponentially as the number of features increase. Therefore,
straightforward; and (iv) it provides a stochastic and faster alter-              trying to account for every possible combination is inefficient.
native to the optimal-feature-subset search algorithms. RSM                       At the other end of the spectrum, is Juszczak and Duin’s approach
approaches have been well-researched for improving diversity,                     [23], using fxc one-class classifiers, one for each feature and each
with well-established benefits for classification applications [39],                of the c classes. This approach requires fewest number of
regression problems [40] and optimal feature selection applica-                   classifiers that can handle any feature combination, but comes
tions [38,41]. However, the feasibility of an RSM based approach                  at a cost of potential performance drop due to disregarding any
has not been explored for the missing feature problem, and hence                  existing relationship between the features, as well as the infor-
constitutes the focus of this paper.                                              mation provided by other existing features.
    Finally, a word on the algorithm name: Learn++ was developed                      Learn++.MF offers a strategic compromise: it trains an ensemble
by reconfiguring AdaBoost [33] to incrementally learn from new                     of classifiers with a subset of the features, randomly drawn from a
data that may later become available. Learn+ generates an
                                                      +
                                                                                  feature distribution, which is iteratively updated to favor selec-
ensemble of classifiers for each new database, and combines                        tion of those features that were previously undersampled. This
them through weighted majority voting. Learn+ draws its training
                                                  +
                                                                                  allows Learn++.MF to achieve classification performances with
data from a distribution, iteratively updated to force the algo-                  little or no loss (compared to a fully intact data) even when large
rithm to learn novel information not yet learned by the ensemble                  portions of data are missing. Without any loss of generality, we
[42,43]. Learn+ .MF, combines the distribution update concepts of
                +
                                                                                  assume that the training dataset is complete. Otherwise, a
       ++
Learn with the random feature selection of RSM, to provide a                      sentinel can be used to flag the missing features, and the
novel solution to the Missing Feature problem.                                    classifiers are then trained on random selection of existing
                                                                                  features. For brevity, we focus on the more critical case of field
                                                                                  data containing missing features.
3. Approach                                                                           The pseudocode of Learn++.MF is provided in Fig. 2. The inputs
                                                                                                                                    È                 É
                                                                                  to the algorithm are the training data S ¼ ðxi ,yi Þ,i ¼ 1,. . .,N ;
3.1. Assumptions and targeted applications of the proposed                        feature subset cardinality; i.e., number of features, nof, used to
approach                                                                          train each classifier; a supervised classification algorithm
                                                                                  (BaseClassifier), the number of classifiers to be created T; and a
   The essence of the proposed approach is to generate a                          sentinel value sen to designate a missing feature. At each iteration
sufficiently large number of classifiers, each trained with a                       t, the algorithm creates one additional classifier Ct. The features
random subset of features. When an instance x with missing                        used for training Ct are drawn from an iteratively updated

 Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern
 Recognition (2010), doi:10.1016/j.patcog.2010.05.028
ARTICLE IN PRESS
4                                                             R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]]




                                                                 Fig. 2. Pseudocode of algorithm Learn++.MF.


distribution Dt that promotes further diversity with respect to the                             During classification of a new instance z, missing (or known to
feature combinations. D1 is initialized to be uniform, hence each                            be corrupt) values are first replaced by a sentinel sen (chosen as a
feature is equally likely to be used in training classifier C1 (step 1).                      value not expected to occur in the data). Then, the features with
                                                                                             the value sen are identified and placed in M(z), the set of missing
D1 ðjÞ ¼ 1=f ,     j ¼ 1, Á Á Á ,f                                                ð1Þ
                                                                                             features.
where f is the total number of features. For each iteration
                                                                                             MðzÞ ¼ argðzðjÞ ¼ ¼ senÞ,          8j, j ¼ 1,. . .,f ,              ð5Þ
t ¼1,y,T, the distribution Dt A Rf that was updated in the
previous iteration is first normalized to ensure that Dt stays as a                                                       Ã
                                                                                                Finally, all classifiers Ct whose feature list Fselection(t) does not
proper distribution                                                                          include the features listed in M(z) are combined through majority
        ,                                                                                    voting to obtain the ensemble classification of z.
          Xf
Dt ¼ Dt       Dt ðjÞ                                           ð2Þ                                              X
                                                                                             EðzÞ ¼ arg max          1ðMðzÞ  Fselection ðtÞÞ ¼ |U               ð6Þ
             j¼1                                                                                         yAY        Ã
                                                                                                                 t:Ct ðzÞ ¼ y
    A random sample of nof features is drawn, without replace-
ment, according to Dt, and the indices of these features are placed
in a list, called F selection ðtÞ A Rnof (step 2). This list keeps track of                  3.3. Choosing algorithm parameters
which features have been used for each classifier, so that
appropriate classifiers can be called during testing. Classifier Ct,                               Considering that feature subsets for each classifier are chosen
is trained (step 3) using the features in Fselection(t), and evaluated                       randomly, and that we wish to use far fewer than 2f À 1 classifiers
on S (step 4) to obtain                                                                      necessary to accommodate every feature subset combination, it is
                                                                                             possible that a particular feature combination required to
          1XN
Perft ¼        1Ct ðxi Þ ¼ yi U                                                   ð3Þ        accommodate a specific set of missing features might not have
          Ni¼1
                                                                                             been sampled by the algorithm. Such an instance cannot be
where 1 Á U evaluates to 1 if the predicate is true. We require Ct to                        processed by Learn++.MF. It is then fair to ask how often Learn+ .MF
                                                                                                                                                                +

perform better than some minimum threshold, such as 1/2 as in                                will not be able to classify a given instance. Formally, given that
AdaBoost, or 1/c (better than the random guess for a reasonably                              Learn++.MF needs at least one useable classifier that can handle the
well-balanced c-class data), on its training data to ensure that it                          current missing feature combination, what is the probability that
has a meaningful classification capacity. The distribution Dt is                              there will be at least one classifier – in an ensemble of T classifiers
then updated (step 5) according to                                                           – that will be able to classify an instance with m of its f features
                                                                                             missing, if each classifier is trained on nofof features? A formal
Dt þ 1 ðFselection ðtÞÞ ¼ bÃDt ðFselection ðtÞÞ,   0 o b r1                       ð4Þ
                                                                                             analysis to answer this question also allows us to determine the
such that the weights of those features in current Fselection(t) are                         relationship among these parameters, and hence can guide us in
reduced by a factor of b. Then, features not in Fselection(t)                                appropriately selecting the algorithm’s free parameters: nof and T.
effectively receive higher weights when Dt is normalized in the                                  A classifier Ct is only useable if none of the nof features selected
next iteration. This gives previously undersampled features a                                for its training data matches one of those missing in x. Then, what
better chance to be selected into the next feature subset. Choosing                          is the probability of finding exactly such a classifier? Without loss
b ¼1 results in a feature distribution that remains uniform                                  of any generality, assume that m missing features are fixed, and
throughout the experiment. We recommend b ¼nof/f, which                                      that we are choosing nof features to train Ct. Then, the probability
reduces the likelihood of previously selected features being                                 of choosing the first feature – among f features – such that it does
selected in the next iteration. In our preliminary efforts, we                               not coincide with any of the m missing ones is (f Àm)/f. The
obtained better performances by using b ¼nof/f over other choices                            probability of choosing the second such feature is (f Àm À 1)/
such as b ¼1 or b ¼1/f, though not always with statistically                                 (f À 1), and similarly the probability of choosing the (nof)th such
significant margins [45].                                                                     feature is ðf ÀmÀðnof À1ÞÞ=ðf Àðnof À1ÞÞ. Then the probability of

    Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern
    Recognition (2010), doi:10.1016/j.patcog.2010.05.028
ARTICLE IN PRESS
                                                         R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]]                                          5


finding a useable classifier Ct for x is                                                  finding a single useable classifier is a success in a Bernoulli
   f Àm f ÀmÀ1 f ÀmÀ2       f ÀmÀðnof À1Þ                                               experiment with probability of success p, then the probability of
p¼     U       U      UÁÁÁU                                                             finding at least t useable classifiers is
     f    f À1   f À2         f Àðnof À1Þ
                                                  nof À1 
                                                           Y                           Pð 4t useable classifiers when m features are missingÞ ¼ pm t
                                                                                                                                                  4
       m         m          m                     m                    m
  ¼ 1À    U 1À       U 1À       U Á Á Á U 1À             ¼         1À                          T  
       f       f À1       f À2               f Àðnof À1Þ              f Ài                   X T
                                                                                                      ðpÞt ð1ÀpÞTÀt
                                                           i¼0
                                                                                           ¼                                                          ð13Þ
                                                                             ð7Þ               t¼t      t

    Note that this is exactly the scenario described by the                                Note that Eq. (13) still needs to be weighted with the
hypergeometric (HG) distribution: we have f features, m of which                        (binomial) probabilities of ‘‘having exactly m missing features’’,
are missing; we then randomly choose nof of these f features. The                       and then summed over all values of m to yield the desired
probability that exactly k of these selected nof features will be one                   probability
of the m missing features is
                                                                                            fX 
                                                                                             Ànof                            !
                                   !,        !                                                    f
                        m      f Àm          f                                          P¼              rm ð1ÀrÞf Àm Uðpm t Þ
                                                                                                                        4
                                                                                                                                                     ð14Þ
p $ HGðk; f ,m,nof Þ ¼                                             ð8Þ                      m¼0
                                                                                                    m
                        k     nof Àk       nof
                                                                                           By using the appropriate expressions from Eqs. (7) and (13) in
   We need k¼0, i.e., none of the features selected is one of the                       Eq. (14), we obtain
missing ones. Then, the desired probability is                                                    2                      0
                                                                                              Ànof 
                                                                                             fX                           X  T  nof À1 
                                                                                                                            T       Y              !t
                                    !                                                                f                                        m
                            m    f Àm                                                   P¼        4          m
                                                                                                           r ð1ÀrÞ Uf Àm @
                                                                                                                                            1À
                                                                                             m¼0
                                                                                                      m                    t¼t t    i¼0
                                                                                                                                               f Ài
                            0     nof         ðf ÀmÞ!    ðf Ànof Þ!
p $ HGðk ¼ 0; f ,m,nof Þ ¼         !    ¼                                                                            !TÀt 13
                                                       U                                            nof À1 
                                                                                                     Y             
                               f          ðf ÀmÀnof Þ!       f!                                                 m
                                                                                            Â 1À            1À            A5                           ð15Þ
                                   nof                                                               i¼0
                                                                                                               f Ài
                                                                             ð9Þ
    Eqs. (7) and (9) are equivalent. However, Eq. (7) is preferable,                        Eq. (15), when evaluated for t¼ 1, is the probability of finding
since it does not use factorials, and hence will not result in                          at least one useable classifier trained on nof features – from a pool
numerical instabilities for large values of f. If we have T such                        of T classifiers – when each feature has a probability r of being
classifiers, each trained on a randomly selected nof features, what                      missing. For t ¼1, Eqs. (10) and (13) are identical. However,
is the probability that at least one will be useable, i.e., at least one                Eq. (13) allows computation of the probability of finding any
of them was trained with a subset of nof features that do not                           number of useable classifiers (say, tdesired, not just one) simply by
include one of the m missing features. The probability that a                           starting the summation from t ¼tdesired. These equations can help
random single classifier is not useable is (1 Àp). Since the feature                     us choose the algorithm’s free parameters, nof and T, to obtain a
selections are independent (specifically for b ¼ 1), the probability                     reasonable confidence that we can process most instances with
of not finding any single useable classifier among all T classifiers is                    missing features.
(1 Àp)T. Hence, the probability of finding at least one useable                              As an example, consider a 30-feature dataset. First, let us fix
classifier is                                                                            probability that a feature is missing at r ¼ 0.05. Then, on average,
                                         !T                                            1.5 features are missing from each instance. Fig. 3(a) shows the
                          nof À1 
                           Y          m                                                 probability of finding at least one classifier for various values of
P ¼ 1Àð1ÀpÞT ¼ 1À 1À              1À                                ð10Þ
                           i¼0
                                     f Ài                                               nof as a function of T. Fig. 3(b) provides similar plots for a higher
                                                                                        rate of r ¼0.20 (20% missing features). The solid curve in each
    If we know how many features are missing, Eq. (10) is exact.
                                                                                        case is the curve obtained by Eq. (15), whereas the various
However, we usually do not know exactly how many features
                                                                                        indicators represent the actual probabilities obtained as a result of
will be missing. At best, we may know the probability r that
                                                                                        1000 simulations. We observe that (i) the experimental data fits
a given feature may be missing. We then have a binomial
                                                                                        the theoretical curve very well; (ii) the probability of finding a
distribution: naming the event ‘‘ a given feature is missing’’ as
                                                                                        useable classifier increases with the ensemble size; and (iii) for
success, then the probability pm of having m of f features missing
                                                                                        a fixed r and a fixed ensemble size, the probability of finding a
in any given instance is the probability of having m successes in f
                                                                                        useable classifier increases as nof decreases. This latter obser-
trials of a Bernoulli experiment, characterized by the binomial
                                                                                        vation makes sense, as fewer the number of features used in
distribution
                                                                                      training, the larger the number of missing features that can be
        f                                                                               accommodated. In fact, if a classifier is trained with nof features
pm ¼        rm ð1ÀrÞf Àm                                      ð11Þ
        m                                                                               out of a total of f features, then that classifier can accommodate
   The actual probability of finding a useable classifier is then a                       up to f Ànof missing features.
weighted sum of the probabilities (1 À p)T, where the weights are                           In Fig. 3(c), we now fix nof¼ 9 (30% of the total feature size),
the probability of having exactly m features missing, with m                            and calculate the probability of finding at least one useable
varying from 0 (no missing features) to maximum allowable                               classifier as a function of ensemble size, for various values of r.
number of fÀ nof missing features.                                                      As expected, for a fixed value of nof and ensemble size, this
            2                                          3                                probability decreases as r increases.
        Ànof 
       fX                          nof À1 
                                     Y             !T                                      In order to see how these results scale with larger feature sets,
               f                                m
P ¼ 1À      4              f Àm
                  rm ð1ÀrÞ U 1À             1À         5     ð12Þ
               m                               f Ài                                     Fig. 4 shows the theoretical plots for the 216—feature multiple
       m¼0                           i¼0
                                                                                        features (MFEAT) dataset [46]. All trends mentioned above can be
   Eq. (12) computes the probability of finding at least one                             observed in this case as well, however, the required ensemble size
useable classifier by subtracting the probability of finding no                           is now in thousands. An ensemble with a few thousand classifiers
useable classifier (in an ensemble of T classifiers) from one.                            is well within today’s computational resources, and in fact not
A more informative way, however, is to compute the probability                          uncommon in many multiple classifier system applications. It is
of finding t useable classifiers (out of T). If the probability of                        clear, however, that as the feature size grows beyond the

 Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern
 Recognition (2010), doi:10.1016/j.patcog.2010.05.028
ARTICLE IN PRESS
6                                                               R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]]


                  Probability of finding at least one useable classifier                                  Probability of finding at least one useable classifier
                                       f=30, ρ =0.05                                                                            f =30, ρ =0.2
       1                                                                                      1

                                                                                            0.9
     0.9
                                                                                            0.8
     0.8                                                                                    0.7

                                                                                            0.6
     0.7
                                                                                            0.5
     0.6                                                                                                                                                nof = 3 (10%)
                                                                                            0.4
                                                             nof = 5 (15%)                                                                              nof = 5 (15%)
     0.5                                                     nof = 9 (30%)                  0.3                                                         nof = 7 (23%)
                                                             nof = 12 (40%)                                                                             nof = 9 (30%)
                                                             nof = 15 (50%)                 0.2                                                         nof = 12 (40%)
     0.4
                                                             nof = 20 (67%)                                                                             nof = 15 (50%)
                                                                                            0.1

                                                                                              0
           0            20             40              60             80             100          0      20      40         60    80        100   120   140   160   180   200
                          Number of classifies in the ensemble                                                     Number of classifiers in the ensemble

                                                            Probability of finding at least one useable classifier
                                                                                  f =30, nof =9
                                                   1

                                                 0.9

                                                 0.8

                                                 0.7

                                                 0.6

                                                 0.5

                                                 0.4                                                                       ρ = 0.05
                                                                                                                           ρ = 0.10
                                                 0.3                                                                       ρ = 0.15
                                                                                                                           ρ = 0.20
                                                 0.2
                                                                                                                           ρ = 0.25
                                                 0.1                                                                       ρ = 0.30
                                                                                                                           ρ = 0.40
                                                   0
                                                       0                50                 100                 150                    200
                                                                    Number of classifiers in the ensemble

               Fig. 3. Probability of finding at least one useable classifier for f¼ 30 (a) r ¼0.05, variable nof, (b) r ¼0.2 variable nof and (c) nof ¼ 9, variable r.




thousands range, the computational burden of this approach                                     avoid repetition of similar outcomes, we present representative
greatly increases. Therefore, Learn++.MF is best suited for appli-                             full results on four datasets here, and provide results for many
cations with moderately high dimensionality (fewer than 1000).                                 additional real world and UCI benchmark datasets online [48], a
This is not an overly restrictive requirement, since many appli-                               summary of which is provided at the end of this section. The four
cations have typically much smaller dimensionalities.                                          datasets featured here encompass a broad range of feature and
                                                                                               database sizes. These datasets are the Wine (13 features),
                                                                                               Wisconsin Breast Cancer (WBC—30 features), Optical Character
4. Simulation results                                                                          Recognition (OCR—62 features) and Multiple Features
                                                                                               (MFEAT—216 features) datasets obtained from the UCI repository
4.1. Experimental setup and overview of results                                                [46]. In all cases, multilayer perceptron type neural networks were
                                                                                               used as the base classifier, though any supervised classifier can
   Learn+ .MF was evaluated on several databases with various
         +
                                                                                               also be used. The training parameters of the base classifiers were
number of features. The algorithm resulted in similar trends in all                            not fine tuned, and selected as fixed reasonable values (error
cases, which we discuss in detail below. Due to the amount of                                  goal¼0.001–0.01, number of hidden layer nodes¼20–30 based on
detail involved with each dataset, as well as for brevity and to                               the dataset). Our goal is not to show the absolute best classification

    Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern
    Recognition (2010), doi:10.1016/j.patcog.2010.05.028
ARTICLE IN PRESS
                                                               R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]]                                                           7


                Probability of finding at least one useable classifier                                    Probability of finding at least one useable classifier
                                     f =216, ρ =0.15                                                                          f=216, nof=40

       1                                                                                   1

   0.9                                                                                  0.9

   0.8                                                                                  0.8

   0.7                                                                                  0.7

   0.6                                                                                  0.6

   0.5                                                                                  0.5

   0.4                                                              nof = 15            0.4
                                                                    nof = 20
   0.3                                                              nof = 25            0.3
                                                                                                                                                             ρ   = 0.05
                                                                    nof = 30
   0.2                                                                                  0.2                                                                  ρ   = 0.10
                                                                    nof = 35                                                                                 ρ   = 0.15
   0.1                                                              nof = 40            0.1                                                                  ρ   = 0.18
                                                                    nof = 50                                                                                 ρ   = 0.20
       0                                                                                   0
           0           1000          2000            3000                      4000            0        500     1000 1500 2000 2500 3000                     3500         4000
                      Number of classifiers in the ensemble                                                     Number of classifiers in the ensemble

                               Fig. 4. Probability of finding at least one classifier, f ¼ 216 (a) r ¼0.15, variable nof and (b) nof¼ 40, variable r.



Table 1
Datasets used in this study.

 Dataset k          Dataset size        Number of classes           Number of features             nof1        nof2       nof3       nof4       nof5      nof6             T

 WINE                  178               3                           13                             3           4          5          6          7        variable          200
 WBC                   400               2                           30                            10          12         14         16          –        variable         1000
 OCR                 5620               10                           62                            16          20         24          –          –        variable         1000
 MFEAT               2000               10                          216                            10          20         30         40         60        variable         2000
 VOC-In                384               6                            6                             2           3          –          –          –        variable          100
 VOC-IIn                84              12                           12                             3           4          5          6          –        variable          200
 IONn                  270               2                           34                             8          10         12         14          –        variable         1000
 WATERn                374               4                           38                            12          14         16         18          –        variable         1000
 ECOLIn                347               5                            5                             2           3          –          –          –        variable         1000
 DERMAn                366               6                           34                             8          10         12         14          –        variable         1000
 PENn               10,992              10                           16                             6           7          8          9          –        variable          250

   n
       The full results for these datasets are provided online due to space considerations and may be found at [48].



performances (which are quite good), but rather the effect of the                              observations, discussed below in detail, were as follows: (i) the
missing features and strategic choice of the algorithm parameters.                             algorithm can handle data with missing values, 20% or more, with
    We define the global number of features in a dataset as the                                 little or no performance drop, compared to classifying data with
product of the number of features f, and the number of instances                               all values intact; and (ii) the choice of nof presents a critical trade-
N; and a single test trial as evaluating the algorithm with 0–30%                              off: higher performance can be obtained using a larger nof, when
(in steps of 2.5%), of the global number of features missing from                              fewer features are missing, yet smaller nof yields higher and more
the test data (typically half of the total data size). We evaluate the                         stable performance, and leaves fewer instances unprocessed,
algorithm with different number of features, nof, used in training                             when larger portions of the data are missing.
individual classifiers, including a variable nof, where the number
of features used to train each classifier is determined by randomly
drawing an integer in the 15–50% range of f. All results are                                   4.2. Wine database
averages of 100 trials on test data, reported with 95% confidence
intervals, where training and test sets were randomly reshuffled                                    Wine database comes from the chemical analysis of 13
for each trial. Missing features were simulated by randomly                                    constituents in wines obtained from three cultivars. Previous
replacing actual values with sentinels.                                                        experiments using optimized classifiers trained with all 13
    Table 1 summarizes datasets and parameters used in the                                     features, and evaluated on a test dataset with no missing features,
experiments: the cardinality of the dataset, the number of classes,                            had zero classification error, setting the benchmark target
total number of features f, the specific values of nof used in                                  performance for this database. Six values of nof were considered:
simulations, and the total number of classifiers T generated.                                   3, 4, 5, 6, 7 and variable, where each classifier was trained on – not
    The behavior of the algorithm with respect to its parameters                               with a fixed number of features, but a random selection of one of
was very consistent across the databases. The two primary                                      these values.

 Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern
 Recognition (2010), doi:10.1016/j.patcog.2010.05.028
ARTICLE IN PRESS
8                                                  R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]]


Table 2
Performance on the WINE database.

    nof ¼ 3 (out of 13)                            nof ¼4 (out of 13)                                         nof ¼5 (out of 13)

    PMF              Ensemble Perf    PIP          PMF                Ensemble Perf                 PIP       PMF              Ensemble Perf     PIP

     0.00            100.007 0.00     100           0.00              100.00 70.00                  100        0.00            100.00 70.00      100
     2.50            100.007 0.00     100           2.50               99.67 7 0.71                 100        2.50            100.00 70.00      100
     5.00            100.007 0.00     100           5.00              100.00 70.00                  100        5.00             99.67 7 0.71     100
     7.50             99.67 7 0.71    100           7.50               99.67 7 0.71                 100        7.50             99.33 7 1.43     100
    10.00             99.33 7 0.95    100          10.00               99.67 7 0.71                 100       10.00             99.007 1.09      100
    15.00             99.007 1.09     100          15.00               98.007 1.58                  100       15.00             98.33 7 1.60     100
    20.00             99.007 1.09     100          20.00               98.007 1.91                  100       20.00             96.67 7 1.85     100
    25.00             99.007 1.09     100          25.00               97.66 7 1.53                 100       25.00             96.29 7 1.31     99
    30.00             96.33 7 1.98    100          30.00               94.29 7 1.54                  99       30.00             95.22 7 2.26     98

    nof ¼ 6 (out of 13)                            nof ¼7 (out of 13)                                         nof ¼variable (3–7 out of 13)

    PMF              Ensemble Perf    PIP          PMF                Ensemble Perf                 PIP       PMF              Ensemble Perf     PIP

     0.00            100.007 0.00     100           0.00              100.00 70.00                  100        0.00            100.00 70.00      100
     2.50             99.67 7 0.71    100           2.50               99.33 7 0.95                 100        2.50             99.33 7 0.95     100
     5.00             99.33 7 0.95    100           5.00               99.33 7 0.95                 100        5.00            100.00 70.00      100
     7.50             98.67 7 1.58    100           7.50               98.67 7 1.17                 100        7.50             99.67 7 0.71     100
    10.00             98.33 7 2.20    100          10.00               98.32 7 1.60                 100       10.00             99.67 7 0.74     100
    15.00             96.98 72.00      99          15.00               96.94 7 2.30                  99       15.00             97.67 7 1.53     100
    20.00             96.98 7 2.27     99          20.00               91.27 7 3.55                  96       20.00             98.00 7 1.17     100
    25.00             93.72 7 2.66     95          25.00               93.03 7 4.35                  86       25.00             97.33 7 1.78     100
    30.00             95.29 7 2.63     93          30.00               91.34 7 4.20                  78       30.00             94.67 7 2.18     100


   Table 2 summarizes the test performances using all values of                   (of the same architecture as ensemble members) that employs
nof, and the percentage of instances that could be processed –                    the commonly used mean imputation (M.I.) to replace the missing
correctly or otherwise – with the existing ensemble (described                    data, as well as that of the E.M. algorithm [47]. We observe that
below). The first row with 0.0% missing features is the algorithm’s                Learn++.MF ensemble (significantly) outperforms both E.M. and
performance when classifiers were trained on nof features, but                     mean imputation for all values of PMF, an observation that was
evaluated on a fully intact test data. The algorithm is able to obtain            consistent for all databases we have evaluated (we compare
100% correct classification with any of the nof values used, indi-                 Learn++.MF to naı Bayes and RSM with mean imputation later in
                                                                                                   ¨ve
cating that this dataset does in fact include redundant features.                 the paper). Fig. 5(b) shows the performance of a single usable
   Now, since the feature subsets are selected at random, it is                   classifier, averaged over all T classifiers, as a function of PMF.
possible that the particular set of features available for a given                The average performance of any fixed single classifier is, as
instance do not match the feature combinations selected by any of                 expected, independent of PMF; however, using higher nof values
the classifiers. Such an instance cannot be processed, as there                    are generally associated with better individual classifier perfor-
would be no classifier trained with the unique combination of the                  mance. This makes sense, since there is more information
available features. The column with the heading PIP (percentage                   available to each classifier with a larger nof. However, this
of instances processed) indicates the percentage of those                         comes at the cost of smaller PIP. In Fig. 5(b) and (c) we compare
instances that can be processed by the generated ensemble. As                     the performance and the PIP with a single vs. ensemble classifier.
expected, PIP decreases as the ‘‘% missing features (PMF)’’                       For example, consider 20% PMF, for which the corresponding PIP
increases. We note that with 30% of the features missing from                     values for single classifiers are shown in Fig. 5(b). A single usable
the test data, 100% of this test dataset can be processed (at a                   classifier trained with nof¼ 3 features is able to classify, on
performance of 96.3%) when nof¼3, whereas only 78% can be                         average, 51% of instances with a performance of 79%, whereas the
processed (at a performance of 91.3%) when nof ¼7 features are                    ensemble has a 100% PIP with a performance of 99%. Fig. 5(c)
used for training the classifiers. A few additional observations:                  shows PIP as a function of PMF, both for single classifiers (lower
first, we note that there is no (statistically significant) perfor-                 curves) and the ensembles (upper curves). As expected, PIP
mance drop with up to 25% of features missing when nof¼3, or up                   decreases with increasing PMF for both the ensemble and a single
to 10% when nof¼6, and only a few percentage points (3.5% for                     usable classifier. However, the decline in PIP is much steeper for
nof¼3, 4.7% for nof¼6) when as many as 30% of the features are                    single classifiers. In fact, there is virtually no drop (PIP ¼100%) up
missing—hence the algorithm can handle large amounts of                           until 20% PMF when the ensemble is used (regardless of nof).
missing data with relatively little drop in classification perfor-                 Hence the impact of using an ensemble is two-folds: the ensemble
mance. Second, virtually all test data can be processed—even                      can process a larger percentage of instances with missing
when as much as 30% of the features are missing, if we use                        features, and it also provides a higher classification performance
relatively few features for training the classifiers.                              on those instances, compared to a single classifier.
   For a more in depth analysis of the algorithm’s behavior,                         Also note in Fig. 5(c) that the decline in PIP is much steeper
consider the plots in Fig. 5. Fig. 5(a) illustrates the overall                   with nof ¼7 than it is for nof¼3, similar to decline in classification
performance of the Learn+ .MF ensemble. All performances in the
                            +
                                                                                  performance shown in Fig. 5(a). Hence, our main observation
0–30% interval of PMF with increments of 2.5% are provided                        regarding the trade-off one must accept by choosing a specific
(while the tables skip some intermediate values for brevity). As                  nof value is as follows: classifiers trained with a larger nof
expected, the performance declines as PMF increases – albeit only                 may achieve a higher generalization performance individually,
gradually. The decline is much milder for nof¼3 than it is for                    or initially when fewer features are missing, but are only able
nof¼7, the reasons of which are discussed below. Also included in                 to classify fewer instances as the PMF increases. This observa-
Fig. 5(a) for comparison, is the performance of a single classifier                tion can be explained as follows: using large nof for training

    Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern
    Recognition (2010), doi:10.1016/j.patcog.2010.05.028
ARTICLE IN PRESS
                                                                 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]]                                                  9



                                  Learn++.MF Ensemble Performance vs. MI                                         Single Usable Classifier Performance
                                                                            Learn ++.MF             88
                    100                                                     Ensembles
                                                                                                                  PIP @20% PMF                 26%
                                                                                                    86

                     95                      EM                                                     84                                        20%

                                                                                                                                              33%
                %




                                                                                                %
                                                                                                    82
                                                                                                                                                    37%
                              Mean
                     90       Imputation                                                            80                                         41%

                                                                                                    78                                         51%
                     85
                                                                                                    76
                          0          5       10      15      20    25                     30               0          5     10      15      20    25                  30
                                           % Missing Feature (PMF)                                                        % Missing Feature (PMF)

                                         Percent Instances Processed                                                      Percent Classifiers Usable
                    100                                                                            100

                                                             Learn ++ .MF                           80
                     80                                      Ensemble

                     60                                                                             60
                %




                                                                                               %
                     40                                                                             40
                                               Single
                                               Classifiers
                     20                                                                             20

                      0                                                                                0
                          0          5        10       15           20         25         30               0          5       10       15     20          25          30
                                           % Missing Feature (PMF)                                                        % Missing Feature (PMF)

                          3 features           4 features           5 features            6 features           7 features          Variable     M.I.           E.M.

                                                                                                                 ++
Fig. 5. Detailed analysis of the algorithm on the WINE database with respect to PMF and nof: (a) Learn .MF ensemble performances compared to that of mean imputation,
(b) single classifier performance, (c) single classifier and ensemble PIP and (d) percentage of useable classifiers in the ensemble.




(e.g., 10 out of 13) means that the same large number of features                               the results of averaging 1000 such Monte Carlo simulations for
(10) are required for testing. Therefore, fewer number of missing                               f¼13, nof ¼3–7, PMF [0–100]%, which agrees with the actual
features (only 3, in this case) can be accommodated. In fact, the                               results of Fig. 5(d) (up to 30% PMF is shown in Fig. 7(d)).
probability of no classifier being available for a given combination
of missing features increases with nof . Of course, if all features are
used for training, then no classifier will be able to process any                                4.3. Wisconsin breast cancer (WBC) database
instance with even a single missing feature, hence the original
motivation of this study.                                                                           The WBC database consists of 569 instances of cell-biopsy
    Finally, Fig. 5(d) analyzes the average percentage of usable                                classified as malignant or benign based on 30 features obtained
classifiers (PUC) per given instance, with respect to PMF. The                                   from the individual cells. The expected best performance from
decline in PUC also makes sense: there are fewer classifiers                                     this dataset (when no features are missing) is 97% [46]. Five nof
available to classify instances that have a higher PMF. Further-                                values 10, 12, 14, 16 and variable, were tested.
more, this decline is also steeper for higher nof.                                                  Table 3 summarizes the test performances using all values of
    In summary, then, a larger nof usually provides a better                                    nof, and the PIP values, similar to that of Wine database in Table 2.
individual and/or initial performance than a smaller nof, but the                               The results with this database of larger number of features have
ensemble is more resistant to unusable feature combinations                                     much of the same trends and characteristics mentioned for the
when trained with a smaller nof. How do we, then, choose the nof                                Wine database. Specifically, the proximity of the ensemble
value, since we cannot know PMF in future datasets? A simple                                    performance when used with nof¼ 10, 12, 14 and 16 features
solution is to use a variable number of features—typically in the                               with no missing features to the 97% figure reported in the
range of 15–50% of the total number of features. We observe from                                literature indicates that this database also has a redundant feature
Table 2 and Fig. 5, that such a variable selection provides                                     set. Also, as in the Wine dataset, the algorithm is able to handle up
surprisingly good performance: a good compromise with little                                    to 20% missing features with little or no (statistically) significant
or no loss of unprocessed instances.                                                            performance drop, and only a gradual decline thereafter. As
    As for the other parameter, T—the number of classifiers, the                                 expected, the PIP drops with increasing PMF, and the drop is
trade off is simply a matter of computational resources. A larger                               steeper for larger nof values compared to smaller ones. Using
PIP and PUC will be obtained, if larger number of classifiers is                                 variable number of features provides a good trade off. These
trained. Fig. 5(d) shows PUC for various values of PMF, indicating                              characteristics can also be seen in Fig. 7, where we plot the
the actual numbers obtained for this database after the classifiers                              ensemble performance, and PIP for single and ensemble classifiers
were trained. Family of curves generated using Eq. (15), such as                                as in Fig. 5 for Wine database. Fig. 7(a) also shows that the
those in Fig. 3, as well as Monte Carlo simulations can provide us                              ensemble outperforms a single classifier (of identical architecture)
with these numbers, before training the classifiers. Fig. 6 shows                                that uses mean imputation to replace the missing features. The

 Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern
 Recognition (2010), doi:10.1016/j.patcog.2010.05.028
ARTICLE IN PRESS
10                                                                                                   R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]]


                                                                         100




                             Expected percentage of useable classifier
                                                                         90                                                                                                           nof = 3
                                                                                                                                                                                      nof = 4
                                                                         80                                                                                                           nof = 5
                                                                                                                                                                                      nof = 6
                                                                         70                                                                                                           nof = 7

                                         in the ensemble
                                                                         60
                                                                         50
                                                                         40
                                                                         30
                                                                         20
                                                                         10
                                                                          0
                                                                               0          0.1        0.2        0.3        0.4        0.5         0.6           0.7       0.8        0.9         1
                                                                                                           Probability of any given feature being missing

                                                                                                Fig. 6. Estimating the required ensemble size ahead of time.


                                            Learn++.MF Ensemble Performances vs. M.I.                                                                   Percent of Processed Instances
                    100
                                                                                                           Learn++.MF                 100
                                                                                                           Ensembles
                                                                                                                                        80
                                                                                                                                                                      Learn++.MF
                        90
                                                                                                                                                                      Ensemble
                                                                                                                                        60
                                                                                                                                  %
                %




                                          Mean
                                          Imputation                                                                                    40
                        80
                                                                                                                                        20       Single
                                                                                                                                                 Classifiers
                        70                                                                                                               0
                             0                                            5        10       15          20        25        30               0          5        10     15      20    25                    30
                                                                               % Missing Feature (PMF)                                                        % Missing Feature (PMF)

                                           10 features                                      12 features                 14 features               16 features                   Variable             M.I.

Fig. 7. Analysis of the algorithm on the WBC database with respect to PMF and nof: (a) Learn++.MF ensemble performances compared to that of mean imputation and (b)
single classifier and ensemble PIP.




Table 3
Performance on the WBC database.

 nof ¼ 10 (out of 30)                                                                                 nof¼ 12 (out of 30)                                                 nof¼ 14 (out of 30)

 PMF              Ensemble Perf                                                     PIP               PMF                Ensemble Perf                  PIP               PMF               Ensemble Perf        PIP

  0.00            95.007 0.00                                                       100                0.00              96.00 70.00                    100                0.00             95.00 70.00          100
  2.50            95.207 0.24                                                       100                2.50              95.80 70.18                    100                2.50             94.75 70.24          100
  5.00            95.057 0.30                                                       100                5.00              95.55 70.41                    100                5.00             94.80 70.33          100
  7.50            95.15 7 0.42                                                      100                7.50              95.55 70.11                    100                7.50             94.55 70.56          100
 10.00            95.55 7 0.30                                                      100               10.00              95.34 70.40                    100               10.00             94.39 70.31          100
 15.00            95.107 0.69                                                       100               15.00              94.88 70.50                    100               15.00             93.84 70.95           98
 20.00            94.63 7 0.77                                                      100               20.00              93.08 71.10                     97               20.00             92.35 71.08           91
 25.00            94.53 7 0.81                                                       98               25.00              92.78 70.74                     91               25.00             91.42 71.20           76
 30.00            92.78 7 1.39                                                       92               30.00              89.87 72.02                     75               30.00             89.01 71.73           55

 nof ¼ 16 (out of 30)                                                                                                                         nof¼ var. (10–16 out of 30)

 PMF                                                         Ensemble Perf                                     PIP                            PMF                                Ensemble Perf                   PIP

  0.00                                                       95.007 0.00                                       100                             0.00                              95.50 70.00                     100
  2.50                                                       94.807 0.29                                       100                             2.50                              95.45 70.19                     100
  5.00                                                       94.75 7 0.37                                      100                             5.00                              95.30 70.43                     100
  7.50                                                       95.047 0.52                                       100                             7.50                              95.30 70.46                     100
 10.00                                                       94.24 7 0.48                                       99                            10.00                              95.15 70.39                     100
 15.00                                                       92.507 0.98                                        94                            15.00                              94.79 70.85                     100
 20.00                                                       90.207 1.10                                        79                            20.00                              94.90 70.79                      97
 25.00                                                       86.98 7 2.45                                       56                            25.00                              91.61 71.23                      91
 30.00                                                       86.34 7 3.17                                       34                            30.00                              91.13 71.32                      78



 Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern
 Recognition (2010), doi:10.1016/j.patcog.2010.05.028
Polikar10missing
Polikar10missing
Polikar10missing
Polikar10missing
Polikar10missing
Polikar10missing

Más contenido relacionado

La actualidad más candente

In data streams using classification and clustering
In data streams using classification and clusteringIn data streams using classification and clustering
In data streams using classification and clusteringeSAT Publishing House
 
An Optimal Approach For Knowledge Protection In Structured Frequent Patterns
An Optimal Approach For Knowledge Protection In Structured Frequent PatternsAn Optimal Approach For Knowledge Protection In Structured Frequent Patterns
An Optimal Approach For Knowledge Protection In Structured Frequent PatternsWaqas Tariq
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)eSAT Journals
 
Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...eSAT Journals
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3Laila Fatehy
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
論文サーベイ(Sasaki)
論文サーベイ(Sasaki)論文サーベイ(Sasaki)
論文サーベイ(Sasaki)Hajime Sasaki
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesIRJET Journal
 
Ijartes v1-i2-006
Ijartes v1-i2-006Ijartes v1-i2-006
Ijartes v1-i2-006IJARTES
 
Customer Segmentation with R - Deep Dive into flexclust
Customer Segmentation with R - Deep Dive into flexclustCustomer Segmentation with R - Deep Dive into flexclust
Customer Segmentation with R - Deep Dive into flexclustJim Porzak
 
Improving Tree augmented Naive Bayes for class probability estimation
Improving Tree augmented Naive Bayes for class probability estimationImproving Tree augmented Naive Bayes for class probability estimation
Improving Tree augmented Naive Bayes for class probability estimationBeat Winehouse
 
Observations
ObservationsObservations
Observationsbutest
 
Figure 1
Figure 1Figure 1
Figure 1butest
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
Dat 305 dat305 dat 305 education for service uopstudy.com
Dat 305 dat305 dat 305 education for service   uopstudy.comDat 305 dat305 dat 305 education for service   uopstudy.com
Dat 305 dat305 dat 305 education for service uopstudy.comNewUOPCourse
 
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...IOSR Journals
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 

La actualidad más candente (20)

In data streams using classification and clustering
In data streams using classification and clusteringIn data streams using classification and clustering
In data streams using classification and clustering
 
An Optimal Approach For Knowledge Protection In Structured Frequent Patterns
An Optimal Approach For Knowledge Protection In Structured Frequent PatternsAn Optimal Approach For Knowledge Protection In Structured Frequent Patterns
An Optimal Approach For Knowledge Protection In Structured Frequent Patterns
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
169 s170
169 s170169 s170
169 s170
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)
 
Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...Dimensionality reduction by matrix factorization using concept lattice in dat...
Dimensionality reduction by matrix factorization using concept lattice in dat...
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3
 
Classifiers
ClassifiersClassifiers
Classifiers
 
論文サーベイ(Sasaki)
論文サーベイ(Sasaki)論文サーベイ(Sasaki)
論文サーベイ(Sasaki)
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
Ijartes v1-i2-006
Ijartes v1-i2-006Ijartes v1-i2-006
Ijartes v1-i2-006
 
Classification
ClassificationClassification
Classification
 
Customer Segmentation with R - Deep Dive into flexclust
Customer Segmentation with R - Deep Dive into flexclustCustomer Segmentation with R - Deep Dive into flexclust
Customer Segmentation with R - Deep Dive into flexclust
 
Improving Tree augmented Naive Bayes for class probability estimation
Improving Tree augmented Naive Bayes for class probability estimationImproving Tree augmented Naive Bayes for class probability estimation
Improving Tree augmented Naive Bayes for class probability estimation
 
Observations
ObservationsObservations
Observations
 
Figure 1
Figure 1Figure 1
Figure 1
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Dat 305 dat305 dat 305 education for service uopstudy.com
Dat 305 dat305 dat 305 education for service   uopstudy.comDat 305 dat305 dat 305 education for service   uopstudy.com
Dat 305 dat305 dat 305 education for service uopstudy.com
 
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 

Destacado

Hospira power point template 1
Hospira power point template 1Hospira power point template 1
Hospira power point template 1Jalaynea Cooper
 
Mags And Publications
Mags And PublicationsMags And Publications
Mags And PublicationsDorotea444
 
Living things
Living thingsLiving things
Living thingsW0064577
 

Destacado (7)

Lesson 2
Lesson 2Lesson 2
Lesson 2
 
Lesson 3
Lesson 3Lesson 3
Lesson 3
 
Artigo 2004 enapg243
Artigo   2004 enapg243Artigo   2004 enapg243
Artigo 2004 enapg243
 
Hospira power point template 1
Hospira power point template 1Hospira power point template 1
Hospira power point template 1
 
Mags And Publications
Mags And PublicationsMags And Publications
Mags And Publications
 
Living things
Living thingsLiving things
Living things
 
Lesson 1
Lesson 1Lesson 1
Lesson 1
 

Similar a Polikar10missing

Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...ijcsa
 
SURVEY PAPER ON OUT LIER DETECTION USING FUZZY LOGIC BASED METHOD
SURVEY PAPER ON OUT LIER DETECTION USING FUZZY LOGIC BASED METHODSURVEY PAPER ON OUT LIER DETECTION USING FUZZY LOGIC BASED METHOD
SURVEY PAPER ON OUT LIER DETECTION USING FUZZY LOGIC BASED METHODIJCI JOURNAL
 
Data Analytics on Solar Energy Using Hadoop
Data Analytics on Solar Energy Using HadoopData Analytics on Solar Energy Using Hadoop
Data Analytics on Solar Energy Using HadoopIJMERJOURNAL
 
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATAEFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATAIJCI JOURNAL
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSZac Darcy
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersZac Darcy
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningPramit Choudhary
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Editor IJARCET
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Editor IJARCET
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel basedIJITCA Journal
 
Fault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringFault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringIRJET Journal
 
IRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant ColonyIRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant ColonyIRJET Journal
 
Confiscation of Duplicate Tuples in The Relational Databases
Confiscation of Duplicate Tuples in The Relational DatabasesConfiscation of Duplicate Tuples in The Relational Databases
Confiscation of Duplicate Tuples in The Relational DatabasesIJERA Editor
 
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningA h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
 
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET Journal
 

Similar a Polikar10missing (20)

Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...
 
SURVEY PAPER ON OUT LIER DETECTION USING FUZZY LOGIC BASED METHOD
SURVEY PAPER ON OUT LIER DETECTION USING FUZZY LOGIC BASED METHODSURVEY PAPER ON OUT LIER DETECTION USING FUZZY LOGIC BASED METHOD
SURVEY PAPER ON OUT LIER DETECTION USING FUZZY LOGIC BASED METHOD
 
Data Analytics on Solar Energy Using Hadoop
Data Analytics on Solar Energy Using HadoopData Analytics on Solar Energy Using Hadoop
Data Analytics on Solar Energy Using Hadoop
 
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATAEFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATA
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
 
395 404
395 404395 404
395 404
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel based
 
Fault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringFault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clustering
 
G44093135
G44093135G44093135
G44093135
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
IRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant ColonyIRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant Colony
 
Confiscation of Duplicate Tuples in The Relational Databases
Confiscation of Duplicate Tuples in The Relational DatabasesConfiscation of Duplicate Tuples in The Relational Databases
Confiscation of Duplicate Tuples in The Relational Databases
 
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningA h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learning
 
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its Analysis
 
B0930610
B0930610B0930610
B0930610
 

Último

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Último (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Polikar10missing

  • 1. ARTICLE IN PRESS Pattern Recognition ] (]]]]) ]]]–]]] Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr Learn++.MF: A random subspace approach for the missing feature problem Robi Polikar a,n, Joseph DePasquale a, Hussein Syed Mohammed a, Gavin Brown b, Ludmilla I. Kuncheva c a Electrical and Computer Eng., Rowan University, 201 Mullica Hill Road, Glassboro, NJ 08028, USA b University of Manchester, Manchester, England, UK c University of Bangor, Bangor, Wales, UK a r t i c l e in fo abstract Article history: We introduce Learn+ .MF, an ensemble-of-classifiers based algorithm that employs random subspace + Received 9 November 2009 selection to address the missing feature problem in supervised classification. Unlike most established Received in revised form approaches, Learn+ .MF does not replace missing values with estimated ones, and hence does not need + 16 April 2010 specific assumptions on the underlying data distribution. Instead, it trains an ensemble of classifiers, Accepted 21 May 2010 each on a random subset of the available features. Instances with missing values are classified by the majority voting of those classifiers whose training data did not include the missing features. We show Keywords: that Learn+ .MF can accommodate substantial amount of missing data, and with only gradual decline in + Missing data performance as the amount of missing data increases. We also analyze the effect of the cardinality of Missing features the random feature subsets, and the ensemble size on algorithm performance. Finally, we discuss the Ensemble of classifiers conditions under which the proposed approach is most effective. Random subspace method & 2010 Elsevier Ltd. All rights reserved. 1. Introduction Fig. 1 illustrates such a scenario for a handwritten character recognition application: characters are digitized on an 8 Â 8 grid, 1.1. The missing feature problem creating 64 features, f1–f64, a random subset (of about 20–30%) of which – indicated in orange (light shading) – are missing in each The integrity and completeness of data are essential for any case. Having such a large proportion of randomly varying features classification algorithm. After all, a trained classifier – unless may be viewed as an extreme and unlikely scenario, warranting specifically designed to address this issue – cannot process reacquisition of the entire dataset. However, data reacquisition is instances with missing features, as the missing number(s) in the often expensive, impractical, or sometimes even impossible, input vectors would make the matrix operations involved in data justifying the need for an alternative practical solution. processing impossible. To obtain a valid classification, the data to The classification algorithm described in this paper is designed be classified should be complete with no missing features to provide such a practical solution, accommodating missing (henceforth, we use missing data and missing features inter- features subject to the condition of distributed redundancy changeably). Missing data in real world applications is not (discussed in Section 3), which is satisfied surprisingly often in uncommon: bad sensors, failed pixels, unanswered questions in practice. surveys, malfunctioning equipment, medical tests that cannot be administered under certain conditions, etc. are all familiar 1.2. Current techniques for accommodating missing data scenarios in practice that can result in missing features. Feature values that are beyond the expected dynamic range of the data The simplest approach for dealing with missing data is to due to extreme noise, signal saturation, data corruption, etc. can ignore those instances with missing attributes. Commonly also be treated as missing data. Furthermore, if the entire data are referred to as filtering or list wise deletion approaches, such not acquired under identical conditions (time/location, using the techniques are clearly suboptimal when a large portion of the same equipment, etc.), different data instances may be missing data have missing attributes [1], and of course are infeasible, if different features. each instance is missing at least one or more features. A more pragmatic approach commonly used to accommodate missing data is imputation [2–5]: substitute the missing value with a meaningful estimate. Traditional examples of this approach n Corresponding author: Tel: + 1 856 256 5372; fax: + 1 856 256 5241. include replacing the missing value with one of the existing data E-mail address: polikar@rowan.edu (R. Polikar). points (most similar in some measure) as in hot – deck imputation 0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2010.05.028 Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  • 2. ARTICLE IN PRESS 2 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] Fig. 1. Handwritten character recognition as an illustration of the missing feature problem addressed in this study: a large proportion of feature values are missing (orange/shaded), and the missing features vary randomly from one instance to another (the actual characters are the numbers 9, 7 and 6). For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article. [2,6,7], the mean of that attribute across the data, or the mean of using the existing features) by calculating the fuzzy membership its k-nearest neighbors [4]. In order for the estimate to be a of the data point to its nearest neighbors, clusters or hyperboxes. faithful representative of the missing value, however, the training The parameters of the clusters and hyperboxes are determined data must be sufficiently dense, a requirement rarely satisfied for from the existing data. Algorithms based on the general fuzzy datasets with even modest number of features. Furthermore, min–max neural networks [20] or ARTMAP and fuzzy c-means imputation methods are known to produce biased estimates as clustering [21] are examples of this approach. the proportion of missing data increases. A related, but perhaps a More recently, ensemble based approaches have also been more rigorous approach is to use polynomial regression to proposed. For example, Melville et al. [22] showed that the estimate substitutes for missing values [8]. However, regression algorithm DECORATE, which generates artificial data (with no techniques – besides being difficult to implement in high missing values) from existing data (with missing values) is quite dimensions – assume that the data can reasonably fit to a robust to missing data. On the other hand, Juszczak and Duin [23] particular polynomial, which may not hold for many applications. proposed combining an ensemble of one-class classifiers, each Theoretically rigorous methods with provable performance trained on a single feature. This approach is capable of handling guarantees have also been developed. Many of these methods any combination of missing features, with the fewest number of rely on model based estimation, such as Bayesian estimation classifiers possible. The approach can be very effective as long as [9–11], which calculates posterior probabilities by integrating single features can reasonably estimate the underlying decision over the missing portions of the feature space. Such an approach boundaries, which is not always plausible. also requires a sufficiently dense data distribution; but more In this contribution, we propose an alternative strategy, called importantly, a prior distribution for all unknown parameters must Learn++.MF. It is inspired in part by our previously introduced be specified, which requires prior knowledge. Such knowledge is ensemble-of-classifiers based incremental learning algorithm, typically vague or non-existent, and inaccurate choices often lead Learn++, and in part by the random subspace method (RSM). In to inferior performance. essence, Learn++.MF generates a sufficient number of classifiers, An alternative strategy in model based methods is the each trained on a random subset of features. An instance with one Expectation Maximization (EM) algorithm [12–14,10], justly or more missing attributes is then classified by majority voting of admired for its theoretical elegance. EM consists of two steps, those classifiers that did not use the missing features during their the expectation (E) and the maximization (M) step, which training. Hence, this approach differs in one fundamental aspect iteratively maximize the expectation of the log-likelihood of the from the other techniques: Learn++.MF does not try to estimate the complete data, conditioned on observed data. Conceptually, these values of the missing data; instead it tries to extract the most steps are easy to construct, and the range of problems that can be discriminatory classification information provided by the existing handled by EM is quite broad. However, there are two potential data, by taking advantage of a presumed redundancy in the drawbacks: (i) convergence can be slow if large portions of data feature set. Hence, Learn+ .MF avoids many of the pitfalls of + are missing; and (ii) the M step may be quite difficult if a closed estimation and imputation based techniques. form of the distribution is unknown, or if different instances are As in most missing feature approaches, we assume that the missing different features. In such cases, the theoretical simplicity probability of a feature being missing is independent of the value of EM does not translate into practice [2]. EM also requires prior of that or any other feature in the dataset. This model is referred knowledge that is often unavailable, or estimation of an unknown to as missing completely at random (MCAR) [2]. Hence, given the underlying distribution, which may be computationally prohibi- dataset X¼(xij) where xij represents the jth feature of instance xi, tive for large dimensional datasets. Incorrect distribution estima- and a missing data indicator matrix M¼(Mij), where Mij is 1 if xij is tion often leads to inconsistent results, whereas lack of missing and 0 otherwise, we assume that pðM9XÞ ¼ pðMÞ. This is sufficiently dense data typically causes loss of accuracy. Several the most restrictive mechanism that generates missing data, the variations have been proposed, such as using Gaussian mixture one that provides us with no additional information. However, models [15,16]; or Expected Conditional Maximization, to this is also the only case where list-wise deletion or most mitigate some of these difficulties [2]. imputation approaches lead to no bias. Neural network based approaches have also been proposed. In the rest of this paper, we first provide a review of ensemble Gupta and Lam [17] looked at weight decay on datasets with systems, followed by the description of the Learn++.MF algorithm, missing values, whereas Yoon and Lee [18] proposed the Training- and provide a theoretical analysis to guide the choice of its free EStimation-Training (TEST) algorithm, which predicts the actual parameters: number of features used to train each classifier, and target value from a reasonably well estimated imputed one. Other the ensemble size. We then present results, analyze algorithm approaches use neuro-fuzzy algorithms [19], where unknown performance with respect to its free parameters, and compare values of the data are either estimated (or a classification is made performance metrics with theoretically expected values. We also Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  • 3. ARTICLE IN PRESS R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 3 compare the performance of Learn+ .MF to that of a single classifier + feature(s) needs to be classified, only those classifiers trained with using mean imputation, to naıve Bayes that can naturally handle ¨ the features that are presently available in x are used for the missing features, and an intermediate approach that combines classification. As such, Learn+ .MF follows an alternative paradigm + RSM with mean imputation, as well as to that of one-class for the solution of the missing feature problem: the algorithm ensemble approach of [23]. We conclude with discussions and tries to extract most of the information from the available remarks. features, instead of trying to estimate, or impute, the values of the missing ones. Hence, Learn+ .MF does not require a very dense + training data (though, such data are certainly useful); it does not 2. Background: ensemble and random subspace methods need to know or estimate the underlying distributions; and hence does not suffer from the adverse consequences of a potential An ensemble-based system is obtained by combining diverse incorrect estimate. classifiers. The diversity in the classifiers, typically achieved by Learn++.MF makes two assumptions. First, the feature set must using different training parameters for each classifier, allows be redundant, that is, the classification problem is solvable with individual classifiers to generate different decision boundaries unknown subset(s) of the available features (of course, if the [24–27]. If proper diversity is achieved, a different error is made identities of the redundant features were known, they would have by each classifier, strategic combination of which can then reduce already been excluded from the feature set). Second, the the total error. Starting in early 1990s, with such seminal works as redundancy is distributed randomly over the feature set (hence, [28–30,13,31–33], ensemble systems have since become an time series data would not be well suited for this approach). These important research area in machine learning, whose recent assumptions are primarily due to the random nature of feature reviews can be found in [34–36]. selection, and are shared with all RSM based approaches. We The diversity requirement can be achieved in several ways combine these two assumptions under the name distributed [24–27]: training individual classifiers using different (i) training redundancy. Such datasets are not uncommon in practice (the data (sub)sets; (ii) parameters of a given classifier architecture; scenario in Fig. 1, sensor applications, etc), and it is these (iii) classifier models; or (iv) subsets of features. The last one, also applications for which Learn+ .MF is designed, and expected to + known as the random subspace method (RSM), was originally be most effective. proposed by Ho [37] for constructing an ensemble of decision trees (decision forests). In RSM, classifiers are trained using different random subsets of the features, allowing classifiers to err 3.2. Algorithm Learn+ .MF + in different sub-domains of the feature space. Skurichina points out that RSM works particularly well when the database provides A trivial approach is to create a classifier for each of the 2f À 1 redundant information that is ‘‘dispersed’’ across all features [38]. possible non-empty subsets of the f features. Then, for any The RSM has several desirable attributes: (i) working in a instance with missing features, use those classifiers whose reduced dimensional space also reduces the adverse conse- training feature set did not include the missing ones. Such an quences of the curse of dimensionality; (ii) RSM based ensemble approach, perfectly suitable for datasets with few features, has in classifiers may provide improved classification performance, fact been previously proposed [44]. For large feature sets, this because the synergistic classification capacity of a diverse set of exhaustive approach is infeasible. Besides, the probability of a classifiers compensates for any discriminatory information lost by particular feature combination being missing also diminishes choosing a smaller feature subset; (iii) implementing RSM is quite exponentially as the number of features increase. Therefore, straightforward; and (iv) it provides a stochastic and faster alter- trying to account for every possible combination is inefficient. native to the optimal-feature-subset search algorithms. RSM At the other end of the spectrum, is Juszczak and Duin’s approach approaches have been well-researched for improving diversity, [23], using fxc one-class classifiers, one for each feature and each with well-established benefits for classification applications [39], of the c classes. This approach requires fewest number of regression problems [40] and optimal feature selection applica- classifiers that can handle any feature combination, but comes tions [38,41]. However, the feasibility of an RSM based approach at a cost of potential performance drop due to disregarding any has not been explored for the missing feature problem, and hence existing relationship between the features, as well as the infor- constitutes the focus of this paper. mation provided by other existing features. Finally, a word on the algorithm name: Learn++ was developed Learn++.MF offers a strategic compromise: it trains an ensemble by reconfiguring AdaBoost [33] to incrementally learn from new of classifiers with a subset of the features, randomly drawn from a data that may later become available. Learn+ generates an + feature distribution, which is iteratively updated to favor selec- ensemble of classifiers for each new database, and combines tion of those features that were previously undersampled. This them through weighted majority voting. Learn+ draws its training + allows Learn++.MF to achieve classification performances with data from a distribution, iteratively updated to force the algo- little or no loss (compared to a fully intact data) even when large rithm to learn novel information not yet learned by the ensemble portions of data are missing. Without any loss of generality, we [42,43]. Learn+ .MF, combines the distribution update concepts of + assume that the training dataset is complete. Otherwise, a ++ Learn with the random feature selection of RSM, to provide a sentinel can be used to flag the missing features, and the novel solution to the Missing Feature problem. classifiers are then trained on random selection of existing features. For brevity, we focus on the more critical case of field data containing missing features. 3. Approach The pseudocode of Learn++.MF is provided in Fig. 2. The inputs È É to the algorithm are the training data S ¼ ðxi ,yi Þ,i ¼ 1,. . .,N ; 3.1. Assumptions and targeted applications of the proposed feature subset cardinality; i.e., number of features, nof, used to approach train each classifier; a supervised classification algorithm (BaseClassifier), the number of classifiers to be created T; and a The essence of the proposed approach is to generate a sentinel value sen to designate a missing feature. At each iteration sufficiently large number of classifiers, each trained with a t, the algorithm creates one additional classifier Ct. The features random subset of features. When an instance x with missing used for training Ct are drawn from an iteratively updated Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  • 4. ARTICLE IN PRESS 4 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] Fig. 2. Pseudocode of algorithm Learn++.MF. distribution Dt that promotes further diversity with respect to the During classification of a new instance z, missing (or known to feature combinations. D1 is initialized to be uniform, hence each be corrupt) values are first replaced by a sentinel sen (chosen as a feature is equally likely to be used in training classifier C1 (step 1). value not expected to occur in the data). Then, the features with the value sen are identified and placed in M(z), the set of missing D1 ðjÞ ¼ 1=f , j ¼ 1, Á Á Á ,f ð1Þ features. where f is the total number of features. For each iteration MðzÞ ¼ argðzðjÞ ¼ ¼ senÞ, 8j, j ¼ 1,. . .,f , ð5Þ t ¼1,y,T, the distribution Dt A Rf that was updated in the previous iteration is first normalized to ensure that Dt stays as a à Finally, all classifiers Ct whose feature list Fselection(t) does not proper distribution include the features listed in M(z) are combined through majority , voting to obtain the ensemble classification of z. Xf Dt ¼ Dt Dt ðjÞ ð2Þ X EðzÞ ¼ arg max 1ðMðzÞ Fselection ðtÞÞ ¼ |U ð6Þ j¼1 yAY à t:Ct ðzÞ ¼ y A random sample of nof features is drawn, without replace- ment, according to Dt, and the indices of these features are placed in a list, called F selection ðtÞ A Rnof (step 2). This list keeps track of 3.3. Choosing algorithm parameters which features have been used for each classifier, so that appropriate classifiers can be called during testing. Classifier Ct, Considering that feature subsets for each classifier are chosen is trained (step 3) using the features in Fselection(t), and evaluated randomly, and that we wish to use far fewer than 2f À 1 classifiers on S (step 4) to obtain necessary to accommodate every feature subset combination, it is possible that a particular feature combination required to 1XN Perft ¼ 1Ct ðxi Þ ¼ yi U ð3Þ accommodate a specific set of missing features might not have Ni¼1 been sampled by the algorithm. Such an instance cannot be where 1 Á U evaluates to 1 if the predicate is true. We require Ct to processed by Learn++.MF. It is then fair to ask how often Learn+ .MF + perform better than some minimum threshold, such as 1/2 as in will not be able to classify a given instance. Formally, given that AdaBoost, or 1/c (better than the random guess for a reasonably Learn++.MF needs at least one useable classifier that can handle the well-balanced c-class data), on its training data to ensure that it current missing feature combination, what is the probability that has a meaningful classification capacity. The distribution Dt is there will be at least one classifier – in an ensemble of T classifiers then updated (step 5) according to – that will be able to classify an instance with m of its f features missing, if each classifier is trained on nofof features? A formal Dt þ 1 ðFselection ðtÞÞ ¼ bÃDt ðFselection ðtÞÞ, 0 o b r1 ð4Þ analysis to answer this question also allows us to determine the such that the weights of those features in current Fselection(t) are relationship among these parameters, and hence can guide us in reduced by a factor of b. Then, features not in Fselection(t) appropriately selecting the algorithm’s free parameters: nof and T. effectively receive higher weights when Dt is normalized in the A classifier Ct is only useable if none of the nof features selected next iteration. This gives previously undersampled features a for its training data matches one of those missing in x. Then, what better chance to be selected into the next feature subset. Choosing is the probability of finding exactly such a classifier? Without loss b ¼1 results in a feature distribution that remains uniform of any generality, assume that m missing features are fixed, and throughout the experiment. We recommend b ¼nof/f, which that we are choosing nof features to train Ct. Then, the probability reduces the likelihood of previously selected features being of choosing the first feature – among f features – such that it does selected in the next iteration. In our preliminary efforts, we not coincide with any of the m missing ones is (f Àm)/f. The obtained better performances by using b ¼nof/f over other choices probability of choosing the second such feature is (f Àm À 1)/ such as b ¼1 or b ¼1/f, though not always with statistically (f À 1), and similarly the probability of choosing the (nof)th such significant margins [45]. feature is ðf ÀmÀðnof À1ÞÞ=ðf Àðnof À1ÞÞ. Then the probability of Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  • 5. ARTICLE IN PRESS R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 5 finding a useable classifier Ct for x is finding a single useable classifier is a success in a Bernoulli f Àm f ÀmÀ1 f ÀmÀ2 f ÀmÀðnof À1Þ experiment with probability of success p, then the probability of p¼ U U UÁÁÁU finding at least t useable classifiers is f f À1 f À2 f Àðnof À1Þ nof À1 Y Pð 4t useable classifiers when m features are missingÞ ¼ pm t 4 m m m m m ¼ 1À U 1À U 1À U Á Á Á U 1À ¼ 1À T f f À1 f À2 f Àðnof À1Þ f Ài X T ðpÞt ð1ÀpÞTÀt i¼0 ¼ ð13Þ ð7Þ t¼t t Note that this is exactly the scenario described by the Note that Eq. (13) still needs to be weighted with the hypergeometric (HG) distribution: we have f features, m of which (binomial) probabilities of ‘‘having exactly m missing features’’, are missing; we then randomly choose nof of these f features. The and then summed over all values of m to yield the desired probability that exactly k of these selected nof features will be one probability of the m missing features is fX Ànof ! !, ! f m f Àm f P¼ rm ð1ÀrÞf Àm Uðpm t Þ 4 ð14Þ p $ HGðk; f ,m,nof Þ ¼ ð8Þ m¼0 m k nof Àk nof By using the appropriate expressions from Eqs. (7) and (13) in We need k¼0, i.e., none of the features selected is one of the Eq. (14), we obtain missing ones. Then, the desired probability is 2 0 Ànof fX X T nof À1 T Y !t ! f m m f Àm P¼ 4 m r ð1ÀrÞ Uf Àm @ 1À m¼0 m t¼t t i¼0 f Ài 0 nof ðf ÀmÞ! ðf Ànof Þ! p $ HGðk ¼ 0; f ,m,nof Þ ¼ ! ¼ !TÀt 13 U nof À1 Y f ðf ÀmÀnof Þ! f! m  1À 1À A5 ð15Þ nof i¼0 f Ài ð9Þ Eqs. (7) and (9) are equivalent. However, Eq. (7) is preferable, Eq. (15), when evaluated for t¼ 1, is the probability of finding since it does not use factorials, and hence will not result in at least one useable classifier trained on nof features – from a pool numerical instabilities for large values of f. If we have T such of T classifiers – when each feature has a probability r of being classifiers, each trained on a randomly selected nof features, what missing. For t ¼1, Eqs. (10) and (13) are identical. However, is the probability that at least one will be useable, i.e., at least one Eq. (13) allows computation of the probability of finding any of them was trained with a subset of nof features that do not number of useable classifiers (say, tdesired, not just one) simply by include one of the m missing features. The probability that a starting the summation from t ¼tdesired. These equations can help random single classifier is not useable is (1 Àp). Since the feature us choose the algorithm’s free parameters, nof and T, to obtain a selections are independent (specifically for b ¼ 1), the probability reasonable confidence that we can process most instances with of not finding any single useable classifier among all T classifiers is missing features. (1 Àp)T. Hence, the probability of finding at least one useable As an example, consider a 30-feature dataset. First, let us fix classifier is probability that a feature is missing at r ¼ 0.05. Then, on average, !T 1.5 features are missing from each instance. Fig. 3(a) shows the nof À1 Y m probability of finding at least one classifier for various values of P ¼ 1Àð1ÀpÞT ¼ 1À 1À 1À ð10Þ i¼0 f Ài nof as a function of T. Fig. 3(b) provides similar plots for a higher rate of r ¼0.20 (20% missing features). The solid curve in each If we know how many features are missing, Eq. (10) is exact. case is the curve obtained by Eq. (15), whereas the various However, we usually do not know exactly how many features indicators represent the actual probabilities obtained as a result of will be missing. At best, we may know the probability r that 1000 simulations. We observe that (i) the experimental data fits a given feature may be missing. We then have a binomial the theoretical curve very well; (ii) the probability of finding a distribution: naming the event ‘‘ a given feature is missing’’ as useable classifier increases with the ensemble size; and (iii) for success, then the probability pm of having m of f features missing a fixed r and a fixed ensemble size, the probability of finding a in any given instance is the probability of having m successes in f useable classifier increases as nof decreases. This latter obser- trials of a Bernoulli experiment, characterized by the binomial vation makes sense, as fewer the number of features used in distribution training, the larger the number of missing features that can be f accommodated. In fact, if a classifier is trained with nof features pm ¼ rm ð1ÀrÞf Àm ð11Þ m out of a total of f features, then that classifier can accommodate The actual probability of finding a useable classifier is then a up to f Ànof missing features. weighted sum of the probabilities (1 À p)T, where the weights are In Fig. 3(c), we now fix nof¼ 9 (30% of the total feature size), the probability of having exactly m features missing, with m and calculate the probability of finding at least one useable varying from 0 (no missing features) to maximum allowable classifier as a function of ensemble size, for various values of r. number of fÀ nof missing features. As expected, for a fixed value of nof and ensemble size, this 2 3 probability decreases as r increases. Ànof fX nof À1 Y !T In order to see how these results scale with larger feature sets, f m P ¼ 1À 4 f Àm rm ð1ÀrÞ U 1À 1À 5 ð12Þ m f Ài Fig. 4 shows the theoretical plots for the 216—feature multiple m¼0 i¼0 features (MFEAT) dataset [46]. All trends mentioned above can be Eq. (12) computes the probability of finding at least one observed in this case as well, however, the required ensemble size useable classifier by subtracting the probability of finding no is now in thousands. An ensemble with a few thousand classifiers useable classifier (in an ensemble of T classifiers) from one. is well within today’s computational resources, and in fact not A more informative way, however, is to compute the probability uncommon in many multiple classifier system applications. It is of finding t useable classifiers (out of T). If the probability of clear, however, that as the feature size grows beyond the Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  • 6. ARTICLE IN PRESS 6 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] Probability of finding at least one useable classifier Probability of finding at least one useable classifier f=30, ρ =0.05 f =30, ρ =0.2 1 1 0.9 0.9 0.8 0.8 0.7 0.6 0.7 0.5 0.6 nof = 3 (10%) 0.4 nof = 5 (15%) nof = 5 (15%) 0.5 nof = 9 (30%) 0.3 nof = 7 (23%) nof = 12 (40%) nof = 9 (30%) nof = 15 (50%) 0.2 nof = 12 (40%) 0.4 nof = 20 (67%) nof = 15 (50%) 0.1 0 0 20 40 60 80 100 0 20 40 60 80 100 120 140 160 180 200 Number of classifies in the ensemble Number of classifiers in the ensemble Probability of finding at least one useable classifier f =30, nof =9 1 0.9 0.8 0.7 0.6 0.5 0.4 ρ = 0.05 ρ = 0.10 0.3 ρ = 0.15 ρ = 0.20 0.2 ρ = 0.25 0.1 ρ = 0.30 ρ = 0.40 0 0 50 100 150 200 Number of classifiers in the ensemble Fig. 3. Probability of finding at least one useable classifier for f¼ 30 (a) r ¼0.05, variable nof, (b) r ¼0.2 variable nof and (c) nof ¼ 9, variable r. thousands range, the computational burden of this approach avoid repetition of similar outcomes, we present representative greatly increases. Therefore, Learn++.MF is best suited for appli- full results on four datasets here, and provide results for many cations with moderately high dimensionality (fewer than 1000). additional real world and UCI benchmark datasets online [48], a This is not an overly restrictive requirement, since many appli- summary of which is provided at the end of this section. The four cations have typically much smaller dimensionalities. datasets featured here encompass a broad range of feature and database sizes. These datasets are the Wine (13 features), Wisconsin Breast Cancer (WBC—30 features), Optical Character 4. Simulation results Recognition (OCR—62 features) and Multiple Features (MFEAT—216 features) datasets obtained from the UCI repository 4.1. Experimental setup and overview of results [46]. In all cases, multilayer perceptron type neural networks were used as the base classifier, though any supervised classifier can Learn+ .MF was evaluated on several databases with various + also be used. The training parameters of the base classifiers were number of features. The algorithm resulted in similar trends in all not fine tuned, and selected as fixed reasonable values (error cases, which we discuss in detail below. Due to the amount of goal¼0.001–0.01, number of hidden layer nodes¼20–30 based on detail involved with each dataset, as well as for brevity and to the dataset). Our goal is not to show the absolute best classification Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  • 7. ARTICLE IN PRESS R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 7 Probability of finding at least one useable classifier Probability of finding at least one useable classifier f =216, ρ =0.15 f=216, nof=40 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 nof = 15 0.4 nof = 20 0.3 nof = 25 0.3 ρ = 0.05 nof = 30 0.2 0.2 ρ = 0.10 nof = 35 ρ = 0.15 0.1 nof = 40 0.1 ρ = 0.18 nof = 50 ρ = 0.20 0 0 0 1000 2000 3000 4000 0 500 1000 1500 2000 2500 3000 3500 4000 Number of classifiers in the ensemble Number of classifiers in the ensemble Fig. 4. Probability of finding at least one classifier, f ¼ 216 (a) r ¼0.15, variable nof and (b) nof¼ 40, variable r. Table 1 Datasets used in this study. Dataset k Dataset size Number of classes Number of features nof1 nof2 nof3 nof4 nof5 nof6 T WINE 178 3 13 3 4 5 6 7 variable 200 WBC 400 2 30 10 12 14 16 – variable 1000 OCR 5620 10 62 16 20 24 – – variable 1000 MFEAT 2000 10 216 10 20 30 40 60 variable 2000 VOC-In 384 6 6 2 3 – – – variable 100 VOC-IIn 84 12 12 3 4 5 6 – variable 200 IONn 270 2 34 8 10 12 14 – variable 1000 WATERn 374 4 38 12 14 16 18 – variable 1000 ECOLIn 347 5 5 2 3 – – – variable 1000 DERMAn 366 6 34 8 10 12 14 – variable 1000 PENn 10,992 10 16 6 7 8 9 – variable 250 n The full results for these datasets are provided online due to space considerations and may be found at [48]. performances (which are quite good), but rather the effect of the observations, discussed below in detail, were as follows: (i) the missing features and strategic choice of the algorithm parameters. algorithm can handle data with missing values, 20% or more, with We define the global number of features in a dataset as the little or no performance drop, compared to classifying data with product of the number of features f, and the number of instances all values intact; and (ii) the choice of nof presents a critical trade- N; and a single test trial as evaluating the algorithm with 0–30% off: higher performance can be obtained using a larger nof, when (in steps of 2.5%), of the global number of features missing from fewer features are missing, yet smaller nof yields higher and more the test data (typically half of the total data size). We evaluate the stable performance, and leaves fewer instances unprocessed, algorithm with different number of features, nof, used in training when larger portions of the data are missing. individual classifiers, including a variable nof, where the number of features used to train each classifier is determined by randomly drawing an integer in the 15–50% range of f. All results are 4.2. Wine database averages of 100 trials on test data, reported with 95% confidence intervals, where training and test sets were randomly reshuffled Wine database comes from the chemical analysis of 13 for each trial. Missing features were simulated by randomly constituents in wines obtained from three cultivars. Previous replacing actual values with sentinels. experiments using optimized classifiers trained with all 13 Table 1 summarizes datasets and parameters used in the features, and evaluated on a test dataset with no missing features, experiments: the cardinality of the dataset, the number of classes, had zero classification error, setting the benchmark target total number of features f, the specific values of nof used in performance for this database. Six values of nof were considered: simulations, and the total number of classifiers T generated. 3, 4, 5, 6, 7 and variable, where each classifier was trained on – not The behavior of the algorithm with respect to its parameters with a fixed number of features, but a random selection of one of was very consistent across the databases. The two primary these values. Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  • 8. ARTICLE IN PRESS 8 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] Table 2 Performance on the WINE database. nof ¼ 3 (out of 13) nof ¼4 (out of 13) nof ¼5 (out of 13) PMF Ensemble Perf PIP PMF Ensemble Perf PIP PMF Ensemble Perf PIP 0.00 100.007 0.00 100 0.00 100.00 70.00 100 0.00 100.00 70.00 100 2.50 100.007 0.00 100 2.50 99.67 7 0.71 100 2.50 100.00 70.00 100 5.00 100.007 0.00 100 5.00 100.00 70.00 100 5.00 99.67 7 0.71 100 7.50 99.67 7 0.71 100 7.50 99.67 7 0.71 100 7.50 99.33 7 1.43 100 10.00 99.33 7 0.95 100 10.00 99.67 7 0.71 100 10.00 99.007 1.09 100 15.00 99.007 1.09 100 15.00 98.007 1.58 100 15.00 98.33 7 1.60 100 20.00 99.007 1.09 100 20.00 98.007 1.91 100 20.00 96.67 7 1.85 100 25.00 99.007 1.09 100 25.00 97.66 7 1.53 100 25.00 96.29 7 1.31 99 30.00 96.33 7 1.98 100 30.00 94.29 7 1.54 99 30.00 95.22 7 2.26 98 nof ¼ 6 (out of 13) nof ¼7 (out of 13) nof ¼variable (3–7 out of 13) PMF Ensemble Perf PIP PMF Ensemble Perf PIP PMF Ensemble Perf PIP 0.00 100.007 0.00 100 0.00 100.00 70.00 100 0.00 100.00 70.00 100 2.50 99.67 7 0.71 100 2.50 99.33 7 0.95 100 2.50 99.33 7 0.95 100 5.00 99.33 7 0.95 100 5.00 99.33 7 0.95 100 5.00 100.00 70.00 100 7.50 98.67 7 1.58 100 7.50 98.67 7 1.17 100 7.50 99.67 7 0.71 100 10.00 98.33 7 2.20 100 10.00 98.32 7 1.60 100 10.00 99.67 7 0.74 100 15.00 96.98 72.00 99 15.00 96.94 7 2.30 99 15.00 97.67 7 1.53 100 20.00 96.98 7 2.27 99 20.00 91.27 7 3.55 96 20.00 98.00 7 1.17 100 25.00 93.72 7 2.66 95 25.00 93.03 7 4.35 86 25.00 97.33 7 1.78 100 30.00 95.29 7 2.63 93 30.00 91.34 7 4.20 78 30.00 94.67 7 2.18 100 Table 2 summarizes the test performances using all values of (of the same architecture as ensemble members) that employs nof, and the percentage of instances that could be processed – the commonly used mean imputation (M.I.) to replace the missing correctly or otherwise – with the existing ensemble (described data, as well as that of the E.M. algorithm [47]. We observe that below). The first row with 0.0% missing features is the algorithm’s Learn++.MF ensemble (significantly) outperforms both E.M. and performance when classifiers were trained on nof features, but mean imputation for all values of PMF, an observation that was evaluated on a fully intact test data. The algorithm is able to obtain consistent for all databases we have evaluated (we compare 100% correct classification with any of the nof values used, indi- Learn++.MF to naı Bayes and RSM with mean imputation later in ¨ve cating that this dataset does in fact include redundant features. the paper). Fig. 5(b) shows the performance of a single usable Now, since the feature subsets are selected at random, it is classifier, averaged over all T classifiers, as a function of PMF. possible that the particular set of features available for a given The average performance of any fixed single classifier is, as instance do not match the feature combinations selected by any of expected, independent of PMF; however, using higher nof values the classifiers. Such an instance cannot be processed, as there are generally associated with better individual classifier perfor- would be no classifier trained with the unique combination of the mance. This makes sense, since there is more information available features. The column with the heading PIP (percentage available to each classifier with a larger nof. However, this of instances processed) indicates the percentage of those comes at the cost of smaller PIP. In Fig. 5(b) and (c) we compare instances that can be processed by the generated ensemble. As the performance and the PIP with a single vs. ensemble classifier. expected, PIP decreases as the ‘‘% missing features (PMF)’’ For example, consider 20% PMF, for which the corresponding PIP increases. We note that with 30% of the features missing from values for single classifiers are shown in Fig. 5(b). A single usable the test data, 100% of this test dataset can be processed (at a classifier trained with nof¼ 3 features is able to classify, on performance of 96.3%) when nof¼3, whereas only 78% can be average, 51% of instances with a performance of 79%, whereas the processed (at a performance of 91.3%) when nof ¼7 features are ensemble has a 100% PIP with a performance of 99%. Fig. 5(c) used for training the classifiers. A few additional observations: shows PIP as a function of PMF, both for single classifiers (lower first, we note that there is no (statistically significant) perfor- curves) and the ensembles (upper curves). As expected, PIP mance drop with up to 25% of features missing when nof¼3, or up decreases with increasing PMF for both the ensemble and a single to 10% when nof¼6, and only a few percentage points (3.5% for usable classifier. However, the decline in PIP is much steeper for nof¼3, 4.7% for nof¼6) when as many as 30% of the features are single classifiers. In fact, there is virtually no drop (PIP ¼100%) up missing—hence the algorithm can handle large amounts of until 20% PMF when the ensemble is used (regardless of nof). missing data with relatively little drop in classification perfor- Hence the impact of using an ensemble is two-folds: the ensemble mance. Second, virtually all test data can be processed—even can process a larger percentage of instances with missing when as much as 30% of the features are missing, if we use features, and it also provides a higher classification performance relatively few features for training the classifiers. on those instances, compared to a single classifier. For a more in depth analysis of the algorithm’s behavior, Also note in Fig. 5(c) that the decline in PIP is much steeper consider the plots in Fig. 5. Fig. 5(a) illustrates the overall with nof ¼7 than it is for nof¼3, similar to decline in classification performance of the Learn+ .MF ensemble. All performances in the + performance shown in Fig. 5(a). Hence, our main observation 0–30% interval of PMF with increments of 2.5% are provided regarding the trade-off one must accept by choosing a specific (while the tables skip some intermediate values for brevity). As nof value is as follows: classifiers trained with a larger nof expected, the performance declines as PMF increases – albeit only may achieve a higher generalization performance individually, gradually. The decline is much milder for nof¼3 than it is for or initially when fewer features are missing, but are only able nof¼7, the reasons of which are discussed below. Also included in to classify fewer instances as the PMF increases. This observa- Fig. 5(a) for comparison, is the performance of a single classifier tion can be explained as follows: using large nof for training Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  • 9. ARTICLE IN PRESS R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 9 Learn++.MF Ensemble Performance vs. MI Single Usable Classifier Performance Learn ++.MF 88 100 Ensembles PIP @20% PMF 26% 86 95 EM 84 20% 33% % % 82 37% Mean 90 Imputation 80 41% 78 51% 85 76 0 5 10 15 20 25 30 0 5 10 15 20 25 30 % Missing Feature (PMF) % Missing Feature (PMF) Percent Instances Processed Percent Classifiers Usable 100 100 Learn ++ .MF 80 80 Ensemble 60 60 % % 40 40 Single Classifiers 20 20 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 % Missing Feature (PMF) % Missing Feature (PMF) 3 features 4 features 5 features 6 features 7 features Variable M.I. E.M. ++ Fig. 5. Detailed analysis of the algorithm on the WINE database with respect to PMF and nof: (a) Learn .MF ensemble performances compared to that of mean imputation, (b) single classifier performance, (c) single classifier and ensemble PIP and (d) percentage of useable classifiers in the ensemble. (e.g., 10 out of 13) means that the same large number of features the results of averaging 1000 such Monte Carlo simulations for (10) are required for testing. Therefore, fewer number of missing f¼13, nof ¼3–7, PMF [0–100]%, which agrees with the actual features (only 3, in this case) can be accommodated. In fact, the results of Fig. 5(d) (up to 30% PMF is shown in Fig. 7(d)). probability of no classifier being available for a given combination of missing features increases with nof . Of course, if all features are used for training, then no classifier will be able to process any 4.3. Wisconsin breast cancer (WBC) database instance with even a single missing feature, hence the original motivation of this study. The WBC database consists of 569 instances of cell-biopsy Finally, Fig. 5(d) analyzes the average percentage of usable classified as malignant or benign based on 30 features obtained classifiers (PUC) per given instance, with respect to PMF. The from the individual cells. The expected best performance from decline in PUC also makes sense: there are fewer classifiers this dataset (when no features are missing) is 97% [46]. Five nof available to classify instances that have a higher PMF. Further- values 10, 12, 14, 16 and variable, were tested. more, this decline is also steeper for higher nof. Table 3 summarizes the test performances using all values of In summary, then, a larger nof usually provides a better nof, and the PIP values, similar to that of Wine database in Table 2. individual and/or initial performance than a smaller nof, but the The results with this database of larger number of features have ensemble is more resistant to unusable feature combinations much of the same trends and characteristics mentioned for the when trained with a smaller nof. How do we, then, choose the nof Wine database. Specifically, the proximity of the ensemble value, since we cannot know PMF in future datasets? A simple performance when used with nof¼ 10, 12, 14 and 16 features solution is to use a variable number of features—typically in the with no missing features to the 97% figure reported in the range of 15–50% of the total number of features. We observe from literature indicates that this database also has a redundant feature Table 2 and Fig. 5, that such a variable selection provides set. Also, as in the Wine dataset, the algorithm is able to handle up surprisingly good performance: a good compromise with little to 20% missing features with little or no (statistically) significant or no loss of unprocessed instances. performance drop, and only a gradual decline thereafter. As As for the other parameter, T—the number of classifiers, the expected, the PIP drops with increasing PMF, and the drop is trade off is simply a matter of computational resources. A larger steeper for larger nof values compared to smaller ones. Using PIP and PUC will be obtained, if larger number of classifiers is variable number of features provides a good trade off. These trained. Fig. 5(d) shows PUC for various values of PMF, indicating characteristics can also be seen in Fig. 7, where we plot the the actual numbers obtained for this database after the classifiers ensemble performance, and PIP for single and ensemble classifiers were trained. Family of curves generated using Eq. (15), such as as in Fig. 5 for Wine database. Fig. 7(a) also shows that the those in Fig. 3, as well as Monte Carlo simulations can provide us ensemble outperforms a single classifier (of identical architecture) with these numbers, before training the classifiers. Fig. 6 shows that uses mean imputation to replace the missing features. The Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028
  • 10. ARTICLE IN PRESS 10 R. Polikar et al. / Pattern Recognition ] (]]]]) ]]]–]]] 100 Expected percentage of useable classifier 90 nof = 3 nof = 4 80 nof = 5 nof = 6 70 nof = 7 in the ensemble 60 50 40 30 20 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Probability of any given feature being missing Fig. 6. Estimating the required ensemble size ahead of time. Learn++.MF Ensemble Performances vs. M.I. Percent of Processed Instances 100 Learn++.MF 100 Ensembles 80 Learn++.MF 90 Ensemble 60 % % Mean Imputation 40 80 20 Single Classifiers 70 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 % Missing Feature (PMF) % Missing Feature (PMF) 10 features 12 features 14 features 16 features Variable M.I. Fig. 7. Analysis of the algorithm on the WBC database with respect to PMF and nof: (a) Learn++.MF ensemble performances compared to that of mean imputation and (b) single classifier and ensemble PIP. Table 3 Performance on the WBC database. nof ¼ 10 (out of 30) nof¼ 12 (out of 30) nof¼ 14 (out of 30) PMF Ensemble Perf PIP PMF Ensemble Perf PIP PMF Ensemble Perf PIP 0.00 95.007 0.00 100 0.00 96.00 70.00 100 0.00 95.00 70.00 100 2.50 95.207 0.24 100 2.50 95.80 70.18 100 2.50 94.75 70.24 100 5.00 95.057 0.30 100 5.00 95.55 70.41 100 5.00 94.80 70.33 100 7.50 95.15 7 0.42 100 7.50 95.55 70.11 100 7.50 94.55 70.56 100 10.00 95.55 7 0.30 100 10.00 95.34 70.40 100 10.00 94.39 70.31 100 15.00 95.107 0.69 100 15.00 94.88 70.50 100 15.00 93.84 70.95 98 20.00 94.63 7 0.77 100 20.00 93.08 71.10 97 20.00 92.35 71.08 91 25.00 94.53 7 0.81 98 25.00 92.78 70.74 91 25.00 91.42 71.20 76 30.00 92.78 7 1.39 92 30.00 89.87 72.02 75 30.00 89.01 71.73 55 nof ¼ 16 (out of 30) nof¼ var. (10–16 out of 30) PMF Ensemble Perf PIP PMF Ensemble Perf PIP 0.00 95.007 0.00 100 0.00 95.50 70.00 100 2.50 94.807 0.29 100 2.50 95.45 70.19 100 5.00 94.75 7 0.37 100 5.00 95.30 70.43 100 7.50 95.047 0.52 100 7.50 95.30 70.46 100 10.00 94.24 7 0.48 99 10.00 95.15 70.39 100 15.00 92.507 0.98 94 15.00 94.79 70.85 100 20.00 90.207 1.10 79 20.00 94.90 70.79 97 25.00 86.98 7 2.45 56 25.00 91.61 71.23 91 30.00 86.34 7 3.17 34 30.00 91.13 71.32 78 Please cite this article as: R. Polikar, et al., Learn + + .MF: A random subspace approach for the missing feature problem, Pattern Recognition (2010), doi:10.1016/j.patcog.2010.05.028