SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
PyData London
28th April, 2018
Thomas Huijskens
Senior Data Scientist
How to get better
performance with less
data
3All content copyright © 2017 QuantumBlack, a McKinsey company
Feature collinearity and scarceness of data means we can't just give a model many features and let it decide
which ones are useful and which ones are not.
There are multiple reasons to do feature selection when developing machine learning models:
• Computational burden: Limiting the number of features may reduce the computational burden of processing the data in
the learning algorithm.
• Risk of overfitting: Noise reduction and consequently better class separation may be obtained by adding variables that
are presumably redundant.
• Interpretability: Removing redundant variables from the input data can make the results more interpretable for both the
seasoned practitioner as well as any business stakeholders.
It pays off to do feature selection as part of the model development process
4All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms should:
• remove variables that contain redundant information about the target variable; and
• reduce the overlap in information between the variables in the subset of selected features.
A good feature selection algorithm also shouldn't look at variables purely in isolation:
• Two variables that are useless by themselves can be useful together.
• Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity.
What are the components of a good feature selection algorithm?
5All content copyright © 2017 QuantumBlack, a McKinsey company
Two variables that are useless by themselves can be useful together
1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
6All content copyright © 2017 QuantumBlack, a McKinsey company
Very high variable correlation (or anti-correlation) does not mean absence
of variable complementarity
1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
7All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods Filter methods Embedded methods
8All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Filter methods Embedded methods
9All content copyright © 2017 QuantumBlack, a McKinsey company
Mlxtend is an open-source Python package that implements multiple
wrapper methods
10All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Advantages
• Usually provides the best performing feature
set for that particular type of model.
Disadvantages
• Wrapper methods may generate feature sets
that are overly specific to the learner used.
• As wrapper methods train a new model for
each subset, they are very computationally
intensive.
Filter methods
Filter models do not use a learner on the
original data, but only considers statistical
characteristics of the data set.
Embedded methods
11All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – mutual information
The mutual information quantifies the amount of information obtained about one random variable, through another
random variable. For two variables ! and ", the mutual information is given by
# !; " = &
'
&
(
) *, , log
)(*, ,)
) * )(,)
2* 2,.
It determines how similar the joint distribution ) *, , is to the products of the factored marginal distribution ) * ) , .
12All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – maximizing joint mutual information
In the feature selection problem, we would like to maximise the mutual information between the selected variables
!" and the target #.
$% = arg max
"
, !"; # , /. 1. % = 2,
where 2 is the number of features we want to select.
This is an NP-hard problem, as the set of possible combinations of features grows exponentially.
13All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – maximizing joint mutual information
A popular heuristic in the literature is to use a greedy forward selection method, where features are selected
incrementally, one feature at a time.
Let !" #$
= &'(
, … , &'+ ,(
, be the set of selected features at time step - − 1. The greedy method selects the next
feature 0" such that
0" = arg max
6 ∉8+,(
9 :8+,( ⋃ 6; = .
14All content copyright © 2017 QuantumBlack, a McKinsey company
Filter methods example – maximizing joint mutual information
One can show (proof omitted here) that this is equivalent to the following
!" = arg max
) ∉+,-.
/ 0); 2 − / 0) ; 0+,-. − / 0) ; 0+,-. 2) .
However, the quantities involving 6"78
quickly become intractable computationally because they are (: − 1)-
dimensional integrals!
15All content copyright © 2017 QuantumBlack, a McKinsey company
Mutual information based measures trade off relevancy of a variable
against the redundancy of the information a variable contains
We can use an approximation to the multidimensional integrals to make the computation more tractable:
arg max
& ∉()*+
, -& ; /
relevancy
− 1 2
345
6 75
,(-9:
; -&) − < 2
345
6 75
, -9:
; -& /)
redundancy
,
where 1 and < are to be specified. This greedy algorithm parametrizes a family of mutual information based
feature selection algorithms. The most prominent members of this family are:
1. Joint Mutual Information (JMI): 1 = < =
5
6 75
.
2. Maximum relevancy minimum redundancy(MRMR): 1 =
5
675
and < = 0.
3. Mutual information maximisation (MIM): 1 = < = 0.
16All content copyright © 2017 QuantumBlack, a McKinsey company
There are many open-source Python modules available that do filter-based
feature selection
17All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Advantages
• Usually provides the best performing feature
set for that particular type of model.
Disadvantages
• Wrapper methods may generate feature sets
that are overly specific to the learner used.
• As wrapper methods train a new model for
each subset, they are very computationally
intensive.
Filter methods
Filter models do not use a learner on the
original data, but only considers statistical
characteristics of the data set.
Advantages
• Typically scale better to high-dimensional
data sets than wrapper methods.
• Independent of the learning algorithm.
Disadvantages
• Ignore interaction with learning algorithm.
• Often employs lower-dimensional
approximations to make computations more
tractable. This means they may ignore
interactions between different features.
Embedded methods
Embedded methods are a catch-all group of
techniques which perform feature selection as
part of the model construction process.
18All content copyright © 2017 QuantumBlack, a McKinsey company
Embedded methods example – stability selection
• Stability selection wraps around a base learning algorithm, that has a parameter that controls the amount of
regularization.
• For every value of this parameter, we can get an estimate of which variables to select.
• Stability selection runs the learner on many bootstrap samples of the original data set, and keeps track of which
variables get selected in every sample to form a set of ‘stable’ variables.
Generate
bootstrap
sample
Estimate
LASSO on
bootstrapped
sample
Record
features that
get selected
For each
bootstrap
sample and
each value of
penalization
parameter
Compute posterior
probability of inclusion
Select the
set of ’stable’
features
19All content copyright © 2017 QuantumBlack, a McKinsey company
Stability selection is straightforward to implement in Python, and mature
implementations exist for both Python and R
Iterate over
penalization
parameter
and bootstrap
samples
20All content copyright © 2017 QuantumBlack, a McKinsey company
Feature selection algorithms can be divided into three categories
Set of all
features
Generate
a subset
Learning algorithm +
performance
Set of all
features
Generate
a subset
Learning
algorithm
Performance
Set of all features Subset selection
Wrapper methods
Wrapper models use learning algorithms on the
original data, and assesses the features by the
performance of the learning algorithm.
Advantages
• Usually provides the best performing feature
set for that particular type of model.
Disadvantages
• Wrapper methods may generate feature sets
that are overly specific to the learner used.
• As wrapper methods train a new model for
each subset, they are very computationally
intensive.
Filter methods
Filter models do not use a learner on the
original data, but only considers statistical
characteristics of the data set.
Advantages
• Typically scale better to high-dimensional
data sets than wrapper methods.
• Independent of the learning algorithm.
Disadvantages
• Ignore interaction with learning algorithm.
• Often employs lower-dimensional
approximations to make computations more
tractable. This means they may ignore
interactions between different features.
Embedded methods
Embedded methods are a catch-all group of
techniques which perform feature selection as
part of the model construction process.
Advantages
• Takes interaction between feature subset
search and learning algorithm into account.
Disadvantages
• Computationally more expensive than filter
methods.
21All content copyright © 2017 QuantumBlack, a McKinsey company
Each of these three approaches has its advantages and disadvantages, the primary distinguishing factors being
speed of computation, and the chance of overfitting:
• In terms of speed, filters are faster than embedded methods which are in turn faster than wrappers.
• In terms of overfitting, wrappers have higher learning capacity so are more likely to overfit than embedded methods,
which in turn are more likely to overfit than filter methods.
All of this of course changes with extremes of data/feature availability.
What type of algorithm should I use in practice?

Más contenido relacionado

La actualidad más candente

Iaetsd protecting privacy preserving for cost effective adaptive actions
Iaetsd protecting  privacy preserving for cost effective adaptive actionsIaetsd protecting  privacy preserving for cost effective adaptive actions
Iaetsd protecting privacy preserving for cost effective adaptive actions
Iaetsd Iaetsd
 

La actualidad más candente (20)

Tuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning OptimizationTuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning Optimization
 
AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series
 
Advanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarAdvanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise Webinar
 
Ad Click Prediction - Paper review
Ad Click Prediction - Paper reviewAd Click Prediction - Paper review
Ad Click Prediction - Paper review
 
Python tutorial for ML
Python tutorial for MLPython tutorial for ML
Python tutorial for ML
 
Modeling at scale in systematic trading
Modeling at scale in systematic tradingModeling at scale in systematic trading
Modeling at scale in systematic trading
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
Musings of kaggler
Musings of kagglerMusings of kaggler
Musings of kaggler
 
Alpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold ReinwaldAlpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold Reinwald
 
IRJET- A Comprehensive Study of Artificial Bee Colony (ABC) Algorithms and it...
IRJET- A Comprehensive Study of Artificial Bee Colony (ABC) Algorithms and it...IRJET- A Comprehensive Study of Artificial Bee Colony (ABC) Algorithms and it...
IRJET- A Comprehensive Study of Artificial Bee Colony (ABC) Algorithms and it...
 
Iaetsd protecting privacy preserving for cost effective adaptive actions
Iaetsd protecting  privacy preserving for cost effective adaptive actionsIaetsd protecting  privacy preserving for cost effective adaptive actions
Iaetsd protecting privacy preserving for cost effective adaptive actions
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
 
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
SigOpt at O'Reilly - Best Practices for Scaling Modeling PlatformsSigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
 
Estimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approachEstimating project development effort using clustered regression approach
Estimating project development effort using clustered regression approach
 
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACH
 
Analysis and Implementation of Efficient Association Rules using K-mean and N...
Analysis and Implementation of Efficient Association Rules using K-mean and N...Analysis and Implementation of Efficient Association Rules using K-mean and N...
Analysis and Implementation of Efficient Association Rules using K-mean and N...
 
Presentation: Ad-Click Prediction, A Data-Intensive Problem
Presentation: Ad-Click Prediction, A Data-Intensive ProblemPresentation: Ad-Click Prediction, A Data-Intensive Problem
Presentation: Ad-Click Prediction, A Data-Intensive Problem
 
Tuning for Systematic Trading: Talk 1
Tuning for Systematic Trading: Talk 1Tuning for Systematic Trading: Talk 1
Tuning for Systematic Trading: Talk 1
 
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric StrategyTuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
 
Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningTuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep Learning
 

Similar a PyData London 2018 talk on feature selection

Cloudsim a fast clustering-based feature subset selection algorithm for high...
Cloudsim  a fast clustering-based feature subset selection algorithm for high...Cloudsim  a fast clustering-based feature subset selection algorithm for high...
Cloudsim a fast clustering-based feature subset selection algorithm for high...
ecway
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
ecway
 
Android a fast clustering-based feature subset selection algorithm for high-...
Android  a fast clustering-based feature subset selection algorithm for high-...Android  a fast clustering-based feature subset selection algorithm for high-...
Android a fast clustering-based feature subset selection algorithm for high-...
ecway
 
Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docx
jaffarbikat
 

Similar a PyData London 2018 talk on feature selection (20)

IRJET- Machine Learning Techniques for Code Optimization
IRJET-  	  Machine Learning Techniques for Code OptimizationIRJET-  	  Machine Learning Techniques for Code Optimization
IRJET- Machine Learning Techniques for Code Optimization
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
 
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
 
Cloudsim a fast clustering-based feature subset selection algorithm for high...
Cloudsim  a fast clustering-based feature subset selection algorithm for high...Cloudsim  a fast clustering-based feature subset selection algorithm for high...
Cloudsim a fast clustering-based feature subset selection algorithm for high...
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
Android a fast clustering-based feature subset selection algorithm for high-...
Android  a fast clustering-based feature subset selection algorithm for high-...Android  a fast clustering-based feature subset selection algorithm for high-...
Android a fast clustering-based feature subset selection algorithm for high-...
 
Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docx
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
 
Test PDF file
Test PDF fileTest PDF file
Test PDF file
 
Bug Triage: An Automated Process
Bug Triage: An Automated ProcessBug Triage: An Automated Process
Bug Triage: An Automated Process
 
churn_detection.pptx
churn_detection.pptxchurn_detection.pptx
churn_detection.pptx
 
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug TriageSurvey on Software Data Reduction Techniques Accomplishing Bug Triage
Survey on Software Data Reduction Techniques Accomplishing Bug Triage
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
 
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
 
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
 
Competition16
Competition16Competition16
Competition16
 

Último

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 

Último (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 

PyData London 2018 talk on feature selection

  • 1.
  • 2. PyData London 28th April, 2018 Thomas Huijskens Senior Data Scientist How to get better performance with less data
  • 3. 3All content copyright © 2017 QuantumBlack, a McKinsey company Feature collinearity and scarceness of data means we can't just give a model many features and let it decide which ones are useful and which ones are not. There are multiple reasons to do feature selection when developing machine learning models: • Computational burden: Limiting the number of features may reduce the computational burden of processing the data in the learning algorithm. • Risk of overfitting: Noise reduction and consequently better class separation may be obtained by adding variables that are presumably redundant. • Interpretability: Removing redundant variables from the input data can make the results more interpretable for both the seasoned practitioner as well as any business stakeholders. It pays off to do feature selection as part of the model development process
  • 4. 4All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms should: • remove variables that contain redundant information about the target variable; and • reduce the overlap in information between the variables in the subset of selected features. A good feature selection algorithm also shouldn't look at variables purely in isolation: • Two variables that are useless by themselves can be useful together. • Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity. What are the components of a good feature selection algorithm?
  • 5. 5All content copyright © 2017 QuantumBlack, a McKinsey company Two variables that are useless by themselves can be useful together 1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
  • 6. 6All content copyright © 2017 QuantumBlack, a McKinsey company Very high variable correlation (or anti-correlation) does not mean absence of variable complementarity 1Guyon, I. and Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), pp.1157-1182.
  • 7. 7All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Filter methods Embedded methods
  • 8. 8All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Wrapper models use learning algorithms on the original data, and assesses the features by the performance of the learning algorithm. Filter methods Embedded methods
  • 9. 9All content copyright © 2017 QuantumBlack, a McKinsey company Mlxtend is an open-source Python package that implements multiple wrapper methods
  • 10. 10All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Wrapper models use learning algorithms on the original data, and assesses the features by the performance of the learning algorithm. Advantages • Usually provides the best performing feature set for that particular type of model. Disadvantages • Wrapper methods may generate feature sets that are overly specific to the learner used. • As wrapper methods train a new model for each subset, they are very computationally intensive. Filter methods Filter models do not use a learner on the original data, but only considers statistical characteristics of the data set. Embedded methods
  • 11. 11All content copyright © 2017 QuantumBlack, a McKinsey company Filter methods example – mutual information The mutual information quantifies the amount of information obtained about one random variable, through another random variable. For two variables ! and ", the mutual information is given by # !; " = & ' & ( ) *, , log )(*, ,) ) * )(,) 2* 2,. It determines how similar the joint distribution ) *, , is to the products of the factored marginal distribution ) * ) , .
  • 12. 12All content copyright © 2017 QuantumBlack, a McKinsey company Filter methods example – maximizing joint mutual information In the feature selection problem, we would like to maximise the mutual information between the selected variables !" and the target #. $% = arg max " , !"; # , /. 1. % = 2, where 2 is the number of features we want to select. This is an NP-hard problem, as the set of possible combinations of features grows exponentially.
  • 13. 13All content copyright © 2017 QuantumBlack, a McKinsey company Filter methods example – maximizing joint mutual information A popular heuristic in the literature is to use a greedy forward selection method, where features are selected incrementally, one feature at a time. Let !" #$ = &'( , … , &'+ ,( , be the set of selected features at time step - − 1. The greedy method selects the next feature 0" such that 0" = arg max 6 ∉8+,( 9 :8+,( ⋃ 6; = .
  • 14. 14All content copyright © 2017 QuantumBlack, a McKinsey company Filter methods example – maximizing joint mutual information One can show (proof omitted here) that this is equivalent to the following !" = arg max ) ∉+,-. / 0); 2 − / 0) ; 0+,-. − / 0) ; 0+,-. 2) . However, the quantities involving 6"78 quickly become intractable computationally because they are (: − 1)- dimensional integrals!
  • 15. 15All content copyright © 2017 QuantumBlack, a McKinsey company Mutual information based measures trade off relevancy of a variable against the redundancy of the information a variable contains We can use an approximation to the multidimensional integrals to make the computation more tractable: arg max & ∉()*+ , -& ; / relevancy − 1 2 345 6 75 ,(-9: ; -&) − < 2 345 6 75 , -9: ; -& /) redundancy , where 1 and < are to be specified. This greedy algorithm parametrizes a family of mutual information based feature selection algorithms. The most prominent members of this family are: 1. Joint Mutual Information (JMI): 1 = < = 5 6 75 . 2. Maximum relevancy minimum redundancy(MRMR): 1 = 5 675 and < = 0. 3. Mutual information maximisation (MIM): 1 = < = 0.
  • 16. 16All content copyright © 2017 QuantumBlack, a McKinsey company There are many open-source Python modules available that do filter-based feature selection
  • 17. 17All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Wrapper models use learning algorithms on the original data, and assesses the features by the performance of the learning algorithm. Advantages • Usually provides the best performing feature set for that particular type of model. Disadvantages • Wrapper methods may generate feature sets that are overly specific to the learner used. • As wrapper methods train a new model for each subset, they are very computationally intensive. Filter methods Filter models do not use a learner on the original data, but only considers statistical characteristics of the data set. Advantages • Typically scale better to high-dimensional data sets than wrapper methods. • Independent of the learning algorithm. Disadvantages • Ignore interaction with learning algorithm. • Often employs lower-dimensional approximations to make computations more tractable. This means they may ignore interactions between different features. Embedded methods Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process.
  • 18. 18All content copyright © 2017 QuantumBlack, a McKinsey company Embedded methods example – stability selection • Stability selection wraps around a base learning algorithm, that has a parameter that controls the amount of regularization. • For every value of this parameter, we can get an estimate of which variables to select. • Stability selection runs the learner on many bootstrap samples of the original data set, and keeps track of which variables get selected in every sample to form a set of ‘stable’ variables. Generate bootstrap sample Estimate LASSO on bootstrapped sample Record features that get selected For each bootstrap sample and each value of penalization parameter Compute posterior probability of inclusion Select the set of ’stable’ features
  • 19. 19All content copyright © 2017 QuantumBlack, a McKinsey company Stability selection is straightforward to implement in Python, and mature implementations exist for both Python and R Iterate over penalization parameter and bootstrap samples
  • 20. 20All content copyright © 2017 QuantumBlack, a McKinsey company Feature selection algorithms can be divided into three categories Set of all features Generate a subset Learning algorithm + performance Set of all features Generate a subset Learning algorithm Performance Set of all features Subset selection Wrapper methods Wrapper models use learning algorithms on the original data, and assesses the features by the performance of the learning algorithm. Advantages • Usually provides the best performing feature set for that particular type of model. Disadvantages • Wrapper methods may generate feature sets that are overly specific to the learner used. • As wrapper methods train a new model for each subset, they are very computationally intensive. Filter methods Filter models do not use a learner on the original data, but only considers statistical characteristics of the data set. Advantages • Typically scale better to high-dimensional data sets than wrapper methods. • Independent of the learning algorithm. Disadvantages • Ignore interaction with learning algorithm. • Often employs lower-dimensional approximations to make computations more tractable. This means they may ignore interactions between different features. Embedded methods Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. Advantages • Takes interaction between feature subset search and learning algorithm into account. Disadvantages • Computationally more expensive than filter methods.
  • 21. 21All content copyright © 2017 QuantumBlack, a McKinsey company Each of these three approaches has its advantages and disadvantages, the primary distinguishing factors being speed of computation, and the chance of overfitting: • In terms of speed, filters are faster than embedded methods which are in turn faster than wrappers. • In terms of overfitting, wrappers have higher learning capacity so are more likely to overfit than embedded methods, which in turn are more likely to overfit than filter methods. All of this of course changes with extremes of data/feature availability. What type of algorithm should I use in practice?