5. Underfitting model Overfitting model
In the world of analytics, we are often faced with the challenge to avoid under-fitting and overfitting
models and find a balance. A balanced model has better chances of working for previously unseen
data. I would like to present this Data mining project as a quest for finding a reasonable model that
fares well against all matrices i.e. finding that “middle-path”
Optimalmodel
6. Critical to avoiding an
under-Fitting model is:
• Gather enough but not too much data
• Identify and get rid of noise and outliers
• Remove irrelevant features - can confuse models
• Massage the data well
• Identify nominal , ordinal and continuous feature
• Condition the features before feeding to models
7. Critical to avoiding an
overfitting model is:
• Test , test and more test
• Cross validate models against different mix
• Weigh against multiple performance matrices
• If possible, test against real time unseen data
8. Let’s get started…
Business problem at hand
Feature analysis - interesting observation
Feature selection and transformation
Model building and evaluation
Conclusion
9. Business problem at hand
What we have
Data gathered from recent campaign by a bank
Campaign was about getting people to sign up for term deposits
We have customer information along with information whether those
customer signed up for the term deposit
What we want
A machine learning model that can tell if a new customer is likely to
sign up for term deposit
11. Feature Analysis
Will a new customer sign up for term deposit?
Strong indicator for yes.. Previous outcome
65 % who previously said yes said yes again!!!
Although a lot of outcomes were unknown, still a good feature
0
17.5
35
52.5
70
Previously said yes said no
%whoSignedUpforTermDeposit
12. Feature Analysis
Will a new customer sign up for term deposit?
Strong indicator for yes.. Housing Loan
20 % of those who did not have a housing loan said yes!!!
0
5.5
11
16.5
22
No Housing Loan Housing Loan
%whoSignedUpforTermDeposit
13. Feature Analysis
Will a new customer sign up for term deposit?
Strong indicator for yes.. Loan Default
13% of those who had no loan default said yes.
Nobody with loan default said yes - type of info that classification algorithms can use
0
3.5
7
10.5
14
No Loan Default Loan Default
%whoSignedUpforTermDeposit
14. Feature Analysis
Will a new customer sign up for term deposit?
Moderate indicator for yes.. Age
Percentage almost constant across wide range - not much of a differentiating factor
% who Signed Up for Term Deposit
21
24
27
30
33
36
39
42
45
48
51
54
57
60
0 10 20 30 40
16. Feature selection and transformation
Feature Selection table
Feature Description Pre-processing
age Continuous None
job Categorical Converted to Binary Matrix
marital status Categorical Converted to Binary Matrix
education Categorical
Converted to ordinal. 1 = primary,
2 = secondary, 3 = Tertiary
has credit in default?average yearly balance Continuous Numerically scaled
contact communication mode Categorical Discarded, feature irrelevant
last contact day of the month Categorical Discarded, feature irrelevant
last contact month of year Categorical Discarded, feature irrelevant
last contact duration Continuous Numerically scaled
17. Feature selection and transformation
Feature Selection table
Feature Description Pre-processing
number of contacts performed
during this campaign
Continuous Numerically scaled
number of days that passed by
after the client was last
contacted
Continuous Weak feature, discarded
marital status Categorical Converted to Binary Matrix
outcome of the previous
marketing campaign
YES/NO Converted to Binary
has credit in default? YES/NO Converted to Binary
has housing loan? YES/NO Converted to Binary
has personal loan? YES/NO Converted to Binary
18. Feature selection and transformation
Special mention about pre-processing done on education -
Analysis showed that higher the education level, more are the chances of a person signing up for term
deposit. Converting education to a binary matrix would have caused this information to be lost.
Therefore, the categories were manually converted to numerical scale of 1,2 and 3 with 1 = primary
and 3 = tertiary
0
3.5
7
10.5
14
Primary Secondary Tertiary
%whoSignedUpforTermDeposit
19. Feature selection and transformation
The special processing of “education” feature improved MCC score of several algorithm, specially of
gradient descent and AdaBoost that rely heavily on previous errors
0.38
0.39
0.4
0.41
0.42
Gradiant Descent AdaBoost
MCCscores
21. Model building and evaluation
Choice of models - ensemble models
Random Forest - everyone’s favourite
An ensemble model that combines decision trees
Parameters used
Depth = 5
No of classifiers = 100
AdaBoost - acclaimed
Developed in 2003, it is considered one of the
Best out-of-the box classifier. Combines several
Weak algorithms and learns from mistakes. .
Less susceptible to overfitting
22. Model building and evaluation
Choice of models - non- ensemble models - linear models worked well on the data!
23. Model building and evaluation
Matthews Correlation Coefficient scores of each model
Moderate
Strong
Any model with MCC score greater
then 0.40 is considered strong.
According to stats, 4 different models
qualify, with gradient descent scoring
the most. Does it mean Gradient Descent
Is the right choice? Is it a good fit?
The real question is: does it overfit?
Gradient Descent
AdaBoost
Regression
Neural Net
G
radientD
escent
AdaBoost
R
egression
N
euralN
et
24. Model building and evaluation
Let’s seek the answer using evaluation metrics from 5 fold cross validation
5-fold cross validation - Matrix Accuracy
Gradient Descent
AdaBoost
Regression
Neural Net
Gradient Descent
AdaBoost
Regression
Neural Net
5-fold cross validation - Matrix ROC score
25. Model building and evaluation
Preferred Model
AdaBoost
MCC Score = 0.41 Accuracy = 90% ROC score = 0.88
27. Conclusion
linear ensemble models fitted well
With more effort, a better relationship of the features can be gleaned. For
example, marital status is strongly related to financial position. Such
information can help improve the models further
Quest for an optimal model demonstrated that cross validation is an quite an
useful strategy that can not only save time in testing but also assist in making
a better choice of model
In real world scenario, won’t harm to test all 4 top models on unseen data
28. May the light of Buddha’s wisdom be shown
on all of us and guide us towards good fitting
models.
Final Thoughts ….