Kaggle "Give me some credit" challenge overview

Predicting delinquency on debt

What is the problem?
• X Store has a retail credit card available to
customers

customers
• There can be a number of sources of loss
from this product, but one is customer’s
defaulting on their debt

customers
• There can be a number of sources of loss
from this product, but one is customer’s
defaulting on their debt
• This prevents the store from collecting
payment for products and services
rendered

Is this problem big enough to matter?

• Examining a slice of the customer database
(150,000 customers) we ﬁnd that 6.6% of
customers were seriously delinquent in
payment the last two years

• If only 5% of their carried debt was the
store credit card this is potentially an:

• Average loss of $8.12 per customer

• Average loss of $8.12 per customer
• Potential overall loss of $1.2 million

What can be done?
• There are numerous models that can be
used to predict which customers will
default

What can be done?
default
• This could be used to decrease credit limits
or cancel credit lines for current risky
customers to minimize potential loss

What can be done?
default
• This could be used to decrease credit limits
or cancel credit lines for current risky
customers to minimize potential loss
• Or better screen which customers are
approved for the card

How will I do this?
• This is a basic classiﬁcation problem with
important business implications

How will I do this?
• We’ll examine a few simplistic models to
get an idea of performance

How will I do this?
• We’ll examine a few simplistic models to
get an idea of performance
• Explore decision tree methods to achieve
better performance

What will the models predict delinquency?
Each customer has a number of attributes

John Smith
Delinquent:Yes
Age: 23
Income: $1600
Number of Lines: 4

John Smith
Delinquent:Yes
Age: 23
Income: $1600
Number of Lines: 4
Mary Rasmussen
Delinquent: No
Age: 73
Income: $2200
Number of Lines: 2

John Smith
Delinquent:Yes
Age: 23
Income: $1600
Number of Lines: 4
Mary Rasmussen
Delinquent: No
Age: 73
Income: $2200
Number of Lines: 2
...

John Smith
Delinquent:Yes
Age: 23
Income: $1600
Number of Lines: 4
Mary Rasmussen
Delinquent: No
Age: 73
Income: $2200
Number of Lines: 2
...
We will use the customer attributes to predict
whether they were delinquent

How do we make sure that our solution actually
has predictive power?

We have two slices of the customer dataset

Train
150,000
customers
Delinquency
in dataset

Train Test
150,000
customers
Delinquency
in dataset
101,000
customers
Delinquency
not in
dataset

Train Test
150,000
customers
Delinquency
in dataset
101,000
customers
Delinquency
not in
dataset
None of the customers in the test dataset are
used to train the model

Internally we validate our model performance
with cross-fold validation
Using only the train dataset we can get a sense of how
well our model performs without externally validating it
Train

Train
Train 1
Train 2
Train 3

Train
Train 1
Train 2
Train 3
Train 1
Train 2
Algorithm
Training

Train
Train 1
Train 2
Train 3
Train 1
Train 2
Algorithm
Training
Algorithm
Testing
Train 3

What matters is how well we can predict
the test dataset
We judge this using the accuracy, which is the number
of our predictions correct out of the total number of
predictions made
So with 100,000 customers and an 80% accuracy we
will have correctly predicted whether 80,000
customers will default or not in the next two years

Putting accuracy in context
We could save $600,000 over two years if we
correctly predicted 50% of the customers that would
default and changed their account to prevent it

Putting accuracy in context
We could save $600,000 over two years if we
correctly predicted 50% of the customers that would
default and changed their account to prevent it
The potential loss is minimized by ~$8,000 for every
100,000 customers with each percentage point
increase in accuracy

Looking at the actual data
Assume
$2,500

Looking at the actual data
Assume
$2,500
Assume
0

There is a continuum of algorithmic choices to
tackle the problem
Simpler,
Quicker
Complex,
Slower

tackle the problem
Simpler,
Quicker
Complex,
Slower
Random
Chance

tackle the problem
Simpler,
Quicker
Complex,
Slower
Random
Chance
50%

tackle the problem
Simpler,
Quicker
Complex,
Slower
Random
Chance
50%
Simple
Classiﬁcation

For simple classiﬁcation we pick a single attribute
and ﬁnd the best split in the customers

NumberofCustomers
Times Past Due

NumberofCustomers
Times Past Due
True Positive
True Negative
False Positive
False Negative
1

NumberofCustomers
Times Past Due
True Positive
True Negative
False Positive
False Negative
1 2

NumberofCustomers
Times Past Due
True Positive
True Negative
False Positive
False Negative
1 2 ...

We evaluate possible splits using accuracy,
precision, and sensitivity
Acc = Number correct
Total Number

Total Number
Prec = True Positives
Number of People
Predicted Delinquent

Total Number
Number of People
Sens = True Positives
Number of People
Actually Delinquent

0 20 40 60 80 100
Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
Accuracy
Precision
Sensitivity
Total Number
Number of People
Number of People
Actually Delinquent

0 20 40 60 80 100
Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
Accuracy
Precision
Sensitivity
Total Number
Number of People
Number of People
Actually Delinquent
0.61 KGI on Test Set

However, not all ﬁelds are as informative
Using the number of times past due 60-89 days
we achieve a KGI of 0.5

However, not all ﬁelds are as informative
Using the number of times past due 60-89 days
we achieve a KGI of 0.5
The approach is naive and could be improved but
our time is better spent on different algorithms

Exploring algorithmic choices further
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classiﬁcation
0.50-0.61

Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classiﬁcation
0.50-0.61
Random
Forests

A random forest starts from a decision tree
Customer Data

Customer Data
Find the best split in a set of
randomly chosen attributes

Customer Data
Is age <30?

Customer Data
Is age <30?
No
75,000
Customers>30

Customer Data
Is age <30?
No
75,000
Customers>30
Yes
25,000
Customers <30

Customer Data
Is age <30?
No
75,000
Customers>30
Yes
25,000
Customers <30
...

A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1

...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1

...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
Class assignment of a customer is based on how many
of the decision trees “vote” on how to split an attribute

...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
We use a large number of trees to not over-ﬁt to the
training data
Class assignment of a customer is based on how many
of the decision trees “vote” on how to split an attribute

The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation

The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation
Also parallelized with Mahout and Hadoop since
there is no dependence from one tree to the next

A random forest performs well on the test set
Random Forest
10 trees: 0.779 KGI

Random Forest
10 trees: 0.779 KGI
150 trees: 0.843 KGI

Random Forest
10 trees: 0.779 KGI
1000 trees: 0.850 KGI

Random Forest
10 trees: 0.779 KGI
1000 trees: 0.850 KGI
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
Classification
Random Forests

Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classiﬁcation
0.50-0.61
Random
Forests
0.78-0.85

Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classiﬁcation
0.50-0.61
Random
Forests
0.78-0.85
Gradient Tree
Boosting

Boosting Trees is similar to a Random Forest
Customer Data
Is age <30?
No
Customers
>30 Data
Yes
Customers
<30 Data
...

Boosting Trees is similar to a Random Forest
Customer Data
Is age <30?
No
Customers
>30 Data
Yes
Customers
<30 Data
...
Do an exhaustive search
for best split

How Gradient Boosting Trees differs from
Random Forest
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
The ﬁrst tree is optimized to minimize
a loss function describing the data

Random Forest
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
The next tree is then optimized to
fit whatever variability the first
tree didn’t fit

Random Forest
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
tree didn’t ﬁt
This is a sequential process in
comparison to the random forest

Random Forest
...
Customer Data
Best Split
No
Customers
Data Set 2
Yes
Customers
Data Set 1
tree didn’t ﬁt
This is a sequential process in
comparison to the random forest
We also run the risk of over-ﬁtting
to the data, thus the learning rate

Implementing Gradient Boosted Trees
In Python or R it is easy for initial testing and validation

Implementing Gradient Boosted Trees
In Python or R it is easy for initial testing and validation
There are implementations that use Hadoop but it’s
more complicated to achieve the best performance

Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI

0 0.6 0.8
Learning Rate
0.75
0.8
0.85
KGI
0.2 0.4

0 0.6 0.8
Learning Rate
0.75
0.8
0.85
KGI
0.2 0.4
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
Classification
Random Forests
Boosting Trees

Moving one step further in complexity
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classiﬁcation
0.50-0.61
Random
Forests
0.78-0.85
Gradient Tree
Boosting
0.71-0.8659
Blended
Method

Or more accurately an ensemble of
ensemble methods
Algorithm Progression

ensemble methods
Random Forest

ensemble methods
Random Forest
Extremely Random Forest

ensemble methods
Random Forest
Gradient Tree Boosting

ensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
0.1
0.5
0.01
0.8
0.7
.
.
.

ensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.

Combine all of the model information
Train Data Probabilities
0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.

0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.
Optimize the set of train probabilities
to the known delinquencies

0.1
0.5
0.01
0.8
0.7
.
.
.
0.15
0.6
0.0
0.75
0.68
.
.
.
Optimize the set of train probabilities
to the known delinquencies
Apply the same weighting scheme to the
set of test data probabilities

Implementation can be done in a number of ways
Testing in Python or R is slower, due to the sequential nature
of applying the algorithms
Could be faster parallelized, running each algorithm separately
and combining the results

Assessing model performance
Blending Performance, 100 trees: 0.864394 KGI

0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
Classification
Random Forests
Boosting Trees
Blended

But this performance and the possibility of
additional gains comes at a distinct time cost.
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
Classification
Random Forests
Boosting Trees
Blended

Examining the continuum of choices
Simpler,
Quicker
Complex,
Slower
Random
Chance
0.50
Simple
Classiﬁcation
0.50-0.61
Random
Forests
0.78-0.85
Gradient Tree
Boosting
0.71-0.8659
Blended
Method
0.864

What would be best to implement?

There is a large amount of optimization in the
blended method that could be done

However, this algorithm takes the longest to run.
This constraint will apply in testing and validation also

Random Forests returns a reasonably good result.
It is quick and easily parallelized

Gradient Tree Boosting returns the best result and
runs reasonably fast.
It is not as easily parallelized though

Increases in predictive performance have real
business value
Using any of the more complex algorithms we
achieve an increase of 35% in comparison to random

Increases in predictive performance have real
business value
Using any of the more complex algorithms we
achieve an increase of 35% in comparison to random
Potential decrease of ~$420k in losses by identifying
customers likely to default in the training set alone

Kaggle "Give me some credit" challenge overview

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Kaggle "Give me some credit" challenge overview

Similar a Kaggle "Give me some credit" challenge overview (20)

Más de Adam Pah

Más de Adam Pah (7)

Último

Último (20)

Kaggle "Give me some credit" challenge overview