Más contenido relacionado
La actualidad más candente (20)
Similar a PyData London 2018 talk on feature selection (20)
PyData London 2018 talk on feature selection
- 3. 3All content copyright © 2017 QuantumBlack, a McKinsey company
Feature collinearity and scarceness of data means we can't just give a model many features and let it decide
which ones are useful and which ones are not.
There are multiple reasons to do feature selection when developing machine learning models:
• Computational burden: Limiting the number of features may reduce the computational burden of processing the data in
the learning algorithm.
• Risk of overfitting: Noise reduction and consequently better class separation may be obtained by adding variables that
are presumably redundant.
• Interpretability: Removing redundant variables from the input data can make the results more interpretable for both the
seasoned practitioner as well as any business stakeholders.
It pays off to do feature selection as part of the model development process
- 4. 4All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms should:
• remove variables that contain redundant information about the target variable; and
• reduce the overlap in information between the variables in the subset of selected features.
A good feature selection algorithm also shouldn't look at variables purely in isolation:
• Two variables that are useless by themselves can be useful together.
• Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity.
What are the components of a good feature selection algorithm?
- 5. 5All content copyright © 2017 QuantumBlack, a McKinsey company
Two variables that are useless by themselves can be useful together
1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
- 6. 6All content copyright © 2017 QuantumBlack, a McKinsey company
Very high variable correlation (or anti-correlation) does not mean absence
of variable complementarity
1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
- 7. 7All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods Filter methods Embedded methods
- 8. 8All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Filter methods Embedded methods
- 9. 9All content copyright © 2017 QuantumBlack, a McKinsey company
Mlxtend is an open-source Python package that implements multiple
wrapper methods
- 10. 10All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Advantages
• Usually provides the best performing feature
set for that particular type of model.
Disadvantages
• Wrapper methods may generate feature sets
that are overly specific to the learner used.
• As wrapper methods train a new model for
each subset, they are very computationally
intensive.
Filter methods
Filter models do not use a learner on the
original data, but only considers statistical
characteristics of the data set.
Embedded methods
- 11. 11All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – mutual information
The mutual information quantifies the amount of information obtained about one random variable, through another
random variable. For two variables ! and ", the mutual information is given by
# !; " = &
'
&
(
) *, , log
)(*, ,)
) * )(,)
2* 2,.
It determines how similar the joint distribution ) *, , is to the products of the factored marginal distribution ) * ) , .
- 12. 12All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – maximizing joint mutual information
In the feature selection problem, we would like to maximise the mutual information between the selected variables
!" and the target #.
$% = arg max
"
, !"; # , /. 1. % = 2,
where 2 is the number of features we want to select.
This is an NP-hard problem, as the set of possible combinations of features grows exponentially.
- 13. 13All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – maximizing joint mutual information
A popular heuristic in the literature is to use a greedy forward selection method, where features are selected
incrementally, one feature at a time.
Let !" #$
= &'(
, … , &'+ ,(
, be the set of selected features at time step - − 1. The greedy method selects the next
feature 0" such that
0" = arg max
6 ∉8+,(
9 :8+,( ⋃ 6; = .
- 14. 14All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – maximizing joint mutual information
One can show (proof omitted here) that this is equivalent to the following
!" = arg max
) ∉+,-.
/ 0); 2 − / 0) ; 0+,-. − / 0) ; 0+,-. 2) .
However, the quantities involving 6"78
quickly become intractable computationally because they are (: − 1)-
dimensional integrals!
- 15. 15All content copyright © 2017 QuantumBlack, a McKinsey company
Mutual information based measures trade off relevancy of a variable
against the redundancy of the information a variable contains
We can use an approximation to the multidimensional integrals to make the computation more tractable:
arg max
& ∉()*+
, -& ; /
relevancy
− 1 2
345
6 75
,(-9:
; -&) − < 2
345
6 75
, -9:
; -& /)
redundancy
,
where 1 and < are to be specified. This greedy algorithm parametrizes a family of mutual information based
feature selection algorithms. The most prominent members of this family are:
1. Joint Mutual Information (JMI): 1 = < =
5
6 75
.
2. Maximum relevancy minimum redundancy(MRMR): 1 =
5
675
and < = 0.
3. Mutual information maximisation (MIM): 1 = < = 0.
- 16. 16All content copyright © 2017 QuantumBlack, a McKinsey company
There are many open-source Python modules available that do filter-based
feature selection
- 17. 17All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Advantages
• Usually provides the best performing feature
set for that particular type of model.
Disadvantages
• Wrapper methods may generate feature sets
that are overly specific to the learner used.
• As wrapper methods train a new model for
each subset, they are very computationally
intensive.
Filter methods
Filter models do not use a learner on the
original data, but only considers statistical
characteristics of the data set.
Advantages
• Typically scale better to high-dimensional
data sets than wrapper methods.
• Independent of the learning algorithm.
Disadvantages
• Ignore interaction with learning algorithm.
• Often employs lower-dimensional
approximations to make computations more
tractable. This means they may ignore
interactions between different features.
Embedded methods
Embedded methods are a catch-all group of
techniques which perform feature selection as
part of the model construction process.
- 18. 18All content copyright © 2017 QuantumBlack, a McKinsey company
Embedded methods example – stability selection
• Stability selection wraps around a base learning algorithm, that has a parameter that controls the amount of
regularization.
• For every value of this parameter, we can get an estimate of which variables to select.
• Stability selection runs the learner on many bootstrap samples of the original data set, and keeps track of which
variables get selected in every sample to form a set of ‘stable’ variables.
Generate
bootstrap
sample
Estimate
LASSO on
bootstrapped
sample
Record
features that
get selected
For each
bootstrap
sample and
each value of
penalization
parameter
Compute posterior
probability of inclusion
Select the
set of ’stable’
features
- 19. 19All content copyright © 2017 QuantumBlack, a McKinsey company
Stability selection is straightforward to implement in Python, and mature
implementations exist for both Python and R
Iterate over
penalization
parameter
and bootstrap
samples
- 20. 20All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Advantages
• Usually provides the best performing feature
set for that particular type of model.
Disadvantages
• Wrapper methods may generate feature sets
that are overly specific to the learner used.
• As wrapper methods train a new model for
each subset, they are very computationally
intensive.
Filter methods
Filter models do not use a learner on the
original data, but only considers statistical
characteristics of the data set.
Advantages
• Typically scale better to high-dimensional
data sets than wrapper methods.
• Independent of the learning algorithm.
Disadvantages
• Ignore interaction with learning algorithm.
• Often employs lower-dimensional
approximations to make computations more
tractable. This means they may ignore
interactions between different features.
Embedded methods
Embedded methods are a catch-all group of
techniques which perform feature selection as
part of the model construction process.
Advantages
• Takes interaction between feature subset
search and learning algorithm into account.
Disadvantages
• Computationally more expensive than filter
methods.
- 21. 21All content copyright © 2017 QuantumBlack, a McKinsey company
Each of these three approaches has its advantages and disadvantages, the primary distinguishing factors being
speed of computation, and the chance of overfitting:
• In terms of speed, filters are faster than embedded methods which are in turn faster than wrappers.
• In terms of overfitting, wrappers have higher learning capacity so are more likely to overfit than embedded methods,
which in turn are more likely to overfit than filter methods.
All of this of course changes with extremes of data/feature availability.
What type of algorithm should I use in practice?