SlideShare una empresa de Scribd logo
1 de 22
Predicting Customer Conversion
with Random Forests
A Decision Trees Case Study




Daniel Gerlanc, Principal
Enplus Advisors, Inc.
www.enplusadvisors.com
dgerlanc@enplusadvisors.com
Topics
Objectives       Research Question

                   Bank Prospect
  Data
                    Conversion
                   Decision Trees
Methods
                  Random Forests

 Results
Objective

• Which customer or prospects should
  you call today?
• To whom should you offer incentives?
Dataset

• Direct Marketing campaign for bank
  loans
• http://archive.ics.uci.edu/ml/datasets/Ba
  nk+Marketing
• 45211 records, 17 features
Dataset
Decision Trees
Decision Trees

              Windy    Coat
        yes

Sunny                 No Coat

        no    Coat
Statistical Decision
         Trees

• Randomness
• May not know the relationships ahead
  of time
Decision Trees
Splitting




Deterministic process
Decision Tree Code
 tree.1 <- rpart(takes.loan ~ ., data=bank)




• See the „rpart‟ and „rpart.plot‟ R packages.
• Many parameters available to control the fit.
Make Predictions
predict(tree.1, type=“vector”)
How‟d it do?
 Naïve Accuracy: 11.7%
 Decision Tree Precision: 34.8%
                           Actual
Predicted   no                yes
no          (1)   38,904      (3)   3,444
yes         (2)   1,018       (4)   1,845
Decision Tree
        Problems

• Overfitting the data (high variance)
• May not use all relevant features
Random Forests


One Decision
    Tree



                 Many Decision
                Trees (Ensemble)
Building RF

• Sample from the data
• At each split, sample from the available
  variables
• Repeat for each tree
Motivations for RF

• Create uncorrelated trees
• Variance reduction
• Subspace exploration
Random Forests
rffit.1 <- randomForest(takes.loan ~ ., data=bank)




Most important parameters are:
 Variable    Description                             Default

 ntree       Number of Trees                         500

 mtry        Number of variables to randomly         • square root of # predictors for
             select at each node                       classification
                                                     • # predictors / 3 for regression
How‟d it do?
Naïve Accuracy: 11.7%
Random Forest
  •    Precision: 64.5% (2541 / 3937)
  •    Recall: 48% (2541 / 5289)


                                        Actual

Predicted                yes               no

yes                      (1) 2,541         (3)   2748

no                       (2) 1,396         (4)   38,526
Tuning RF




rffit.1 <- tuneRF(X, y, mtryStart=1, stepFactor=2,improve=0.05)
Benefits of RF

• Good accuracy with default settings
• Relatively easy to make parallel
• Many implementations
 • R, Weka, RapidMiner, Mahout
References

•   A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22.

•   Breiman, Leo. Classification and Regression Trees. Belmont, Calif: Wadsworth International Group, 1984. Print.

•   Brieman, Leo and Adele Cutler. Random forests. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_contact.htm

•   S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM
    Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011,
    pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.

Más contenido relacionado

Destacado

Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Data Con LA
 
Sentiment Analysis of Airline Tweets
Sentiment Analysis of Airline TweetsSentiment Analysis of Airline Tweets
Sentiment Analysis of Airline TweetsMichael Lin
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleSajith Edirisinghe
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandrySri Ambati
 
classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning Shiraz316
 
Forecasting P2P Credit Risk based on Lending Club data
Forecasting P2P Credit Risk based on Lending Club dataForecasting P2P Credit Risk based on Lending Club data
Forecasting P2P Credit Risk based on Lending Club dataArchange Giscard DESTINE
 
Consumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestConsumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestHirak Sen Roy
 
Logistic regression with low event rate (rare events)
Logistic regression with low event rate (rare events)Logistic regression with low event rate (rare events)
Logistic regression with low event rate (rare events)Tejamoy Ghosh
 
Estimation of the probability of default : Credit Rish
Estimation of the probability of default : Credit RishEstimation of the probability of default : Credit Rish
Estimation of the probability of default : Credit RishArsalan Qadri
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsSalford Systems
 
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Magnify Analytic Solutions
 
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)Sri Ambati
 
Data Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetData Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetMateusz Brzoska
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMNYC Predictive Analytics
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval Venkata Reddy Konasani
 
LinkedIn Connect to Opportunity™ -- Stories of Discovery
LinkedIn Connect to Opportunity™ -- Stories of DiscoveryLinkedIn Connect to Opportunity™ -- Stories of Discovery
LinkedIn Connect to Opportunity™ -- Stories of DiscoveryLinkedIn
 

Destacado (20)

Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
 
Sentiment Analysis of Airline Tweets
Sentiment Analysis of Airline TweetsSentiment Analysis of Airline Tweets
Sentiment Analysis of Airline Tweets
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - Kaggle
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark Landry
 
classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning
 
Forecasting P2P Credit Risk based on Lending Club data
Forecasting P2P Credit Risk based on Lending Club dataForecasting P2P Credit Risk based on Lending Club data
Forecasting P2P Credit Risk based on Lending Club data
 
Consumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestConsumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random Forest
 
Logistic regression with low event rate (rare events)
Logistic regression with low event rate (rare events)Logistic regression with low event rate (rare events)
Logistic regression with low event rate (rare events)
 
Estimation of the probability of default : Credit Rish
Estimation of the probability of default : Credit RishEstimation of the probability of default : Credit Rish
Estimation of the probability of default : Credit Rish
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
 
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)
 
Data Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetData Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data Set
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Random forest
Random forestRandom forest
Random forest
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVM
 
Introduction to Modeling
Introduction to ModelingIntroduction to Modeling
Introduction to Modeling
 
Xgboost
XgboostXgboost
Xgboost
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval
 
LinkedIn Connect to Opportunity™ -- Stories of Discovery
LinkedIn Connect to Opportunity™ -- Stories of DiscoveryLinkedIn Connect to Opportunity™ -- Stories of Discovery
LinkedIn Connect to Opportunity™ -- Stories of Discovery
 

Similar a Predicting Customer Conversion with Random Forests

From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forestsViet-Trung TRAN
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agricultureAboul Ella Hassanien
 
Big Data Analytics for connected home
Big Data Analytics for connected homeBig Data Analytics for connected home
Big Data Analytics for connected homeHéloïse Nonne
 
The (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentThe (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentPedro Staziaki
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learningjaumebp
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Maarten Smeets
 
2023 Supervised Learning for Orange3 from scratch
2023 Supervised Learning for Orange3 from scratch2023 Supervised Learning for Orange3 from scratch
2023 Supervised Learning for Orange3 from scratchFEG
 
Anthill Talk Aditya
Anthill Talk AdityaAnthill Talk Aditya
Anthill Talk AdityaAditya Patel
 

Similar a Predicting Customer Conversion with Random Forests (20)

Machine Learning Workshop
Machine Learning WorkshopMachine Learning Workshop
Machine Learning Workshop
 
Parkinson disease classification recorded v2.0
Parkinson disease classification recorded   v2.0Parkinson disease classification recorded   v2.0
Parkinson disease classification recorded v2.0
 
Parkinson disease classification v2.0
Parkinson disease classification v2.0Parkinson disease classification v2.0
Parkinson disease classification v2.0
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Random Forest
Random ForestRandom Forest
Random Forest
 
Learning from data
Learning from dataLearning from data
Learning from data
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
Big Data Analytics for connected home
Big Data Analytics for connected homeBig Data Analytics for connected home
Big Data Analytics for connected home
 
Modeling full scale-data(2)
Modeling full scale-data(2)Modeling full scale-data(2)
Modeling full scale-data(2)
 
Data analytics, a (short) tour
Data analytics, a (short) tourData analytics, a (short) tour
Data analytics, a (short) tour
 
The (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentThe (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology resident
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
 
2023 Supervised Learning for Orange3 from scratch
2023 Supervised Learning for Orange3 from scratch2023 Supervised Learning for Orange3 from scratch
2023 Supervised Learning for Orange3 from scratch
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 
Anthill Talk Aditya
Anthill Talk AdityaAnthill Talk Aditya
Anthill Talk Aditya
 

Predicting Customer Conversion with Random Forests

  • 1. Predicting Customer Conversion with Random Forests A Decision Trees Case Study Daniel Gerlanc, Principal Enplus Advisors, Inc. www.enplusadvisors.com dgerlanc@enplusadvisors.com
  • 2. Topics Objectives Research Question Bank Prospect Data Conversion Decision Trees Methods Random Forests Results
  • 3. Objective • Which customer or prospects should you call today? • To whom should you offer incentives?
  • 4. Dataset • Direct Marketing campaign for bank loans • http://archive.ics.uci.edu/ml/datasets/Ba nk+Marketing • 45211 records, 17 features
  • 7. Decision Trees Windy Coat yes Sunny No Coat no Coat
  • 8. Statistical Decision Trees • Randomness • May not know the relationships ahead of time
  • 11. Decision Tree Code tree.1 <- rpart(takes.loan ~ ., data=bank) • See the „rpart‟ and „rpart.plot‟ R packages. • Many parameters available to control the fit.
  • 13. How‟d it do? Naïve Accuracy: 11.7% Decision Tree Precision: 34.8% Actual Predicted no yes no (1) 38,904 (3) 3,444 yes (2) 1,018 (4) 1,845
  • 14. Decision Tree Problems • Overfitting the data (high variance) • May not use all relevant features
  • 15. Random Forests One Decision Tree Many Decision Trees (Ensemble)
  • 16. Building RF • Sample from the data • At each split, sample from the available variables • Repeat for each tree
  • 17. Motivations for RF • Create uncorrelated trees • Variance reduction • Subspace exploration
  • 18. Random Forests rffit.1 <- randomForest(takes.loan ~ ., data=bank) Most important parameters are: Variable Description Default ntree Number of Trees 500 mtry Number of variables to randomly • square root of # predictors for select at each node classification • # predictors / 3 for regression
  • 19. How‟d it do? Naïve Accuracy: 11.7% Random Forest • Precision: 64.5% (2541 / 3937) • Recall: 48% (2541 / 5289) Actual Predicted yes no yes (1) 2,541 (3) 2748 no (2) 1,396 (4) 38,526
  • 20. Tuning RF rffit.1 <- tuneRF(X, y, mtryStart=1, stepFactor=2,improve=0.05)
  • 21. Benefits of RF • Good accuracy with default settings • Relatively easy to make parallel • Many implementations • R, Weka, RapidMiner, Mahout
  • 22. References • A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22. • Breiman, Leo. Classification and Regression Trees. Belmont, Calif: Wadsworth International Group, 1984. Print. • Brieman, Leo and Adele Cutler. Random forests. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_contact.htm • S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.

Notas del editor

  1. Tools that help you decide how to spend those limited resources.