Dynamic Bayesian modeling for risk prediction in credit operations (SCAI2015)

Dynamic Bayesian modeling for risk
prediction in credit operations
Hanen Borchani1, Ana M. Martínez1, Andrés R. Masegosa2,
Helge Langseth2, Thomas D. Nielsen1, Antonio Salmerón3,
Antonio Fernández4, Anders L. Madsen1,5, Ramón Sáez4
1Department of Computer Science, Aalborg University, Denmark
2 Department of Computer and Information Science,
The Norwegian University of Science and Technology, Norway
3Department of Mathematics, University of Almería, Spain
4 Banco de Crédito Cooperativo, Spain
5 Hugin Expert A/S, Aalborg, Denmark
Scandinavian Conference on Artiﬁcial Intelligence, Halmstad, November 5–6, 2015 1

Outline
1 Introduction
2 The ﬁnancial data set
3 Risk prediction using dynamic Bayesian networks
4 Experimental results
5 Conclusion

Outline
1 Introduction
5 Conclusion

Introduction
Eﬃcient solutions for risk prediction in banks can be crucial for reducing
losses due to ineﬃcient business procedures.
Such solutions can be used as tools for monitoring the evolution of
customers in terms of credit operations risk to increase solvency of the
banking institutions.

Introduction
From a machine learning perspective, credit scoring has traditionally
been approached as a supervised classiﬁcation problem.

Introduction
From a machine learning perspective, credit scoring has traditionally
been approached as a supervised classiﬁcation problem.
However, recently, this problem presents additional challenging
characteristics that separate it from the standard supervised
classiﬁcation problems.

Challenges
Classiﬁcation in a streaming context: a stream of multiple sequences received
over time, each sequence representing a particular client. That is, at every time
step t, we receive the data Dt containing information about all the clients.

Challenges
A delayed class-feedback: the class label for each sample/client corresponds to
the client’s defaulting behavior in the following twelve months and this
information is therefore only available after a twelve month delay. Thus, the
available data is a mixture of labeled and unlabeled samples.

Challenges
Concept drift: the domain exhibits a form of concept drift where the data
distribution as well as the set of feature variables relevant for classiﬁcation
may vary over time.

Challenges
Concept drift: the domain exhibits a form of concept drift where the data
distribution as well as the set of feature variables relevant for classiﬁcation
may vary over time.
Objective: Explore the credit scoring problem based on a real-world data set.

Outline
1 Introduction
5 Conclusion

The ﬁnancial data set
Provided by a Spanish bank in the Almería region: Banco de Crédito
Cooperativo (BCC).
It contains monthly aggregated information for a set of BCC clients for the
period from April 2007 to March 2014.

Cooperativo (BCC).
Only “active” clients are considered, meaning that we restrict our attention to
individuals between 18 and 65 years of age, who have at least one automatic
bill payment or direct debit in the bank.
BCC employees are excluded since they have special conditions.

Cooperativo (BCC).
Only “active” clients are considered, meaning that we restrict our attention to
individuals between 18 and 65 years of age, who have at least one automatic
bill payment or direct debit in the bank.
BCC employees are excluded since they have special conditions.
The resulting data set includes 50 000 clients each month.

44 feature variables, denoted Xt, where 11 variables describing the ﬁnancial
status of a client (VARXX) and 33 socio-demographic variables (SOCXX).
Variable ID Description Variable ID Description
VAR01 Total credit amount VAR07 Unpaid amount in mortgages
VAR02 Income VAR08 Unpaid amount in personal loans
VAR03 Expenses VAR09 Unpaid amount in credit cards
VAR04 Account balance VAR10 Unpaid amount in bank account deﬁcit
VAR05 Risk balance in mortgages VAR11 Unpaid amount in other products
VAR06 Risk balance in consumer loans SOC01-33 Set of 33 socio-demographic variables

44 feature variables, denoted Xt, where 11 variables describing the ﬁnancial
status of a client (VARXX) and 33 socio-demographic variables (SOCXX).
Variable ID Description Variable ID Description
VAR01 Total credit amount VAR07 Unpaid amount in mortgages
VAR02 Income VAR08 Unpaid amount in personal loans
VAR03 Expenses VAR09 Unpaid amount in credit cards
VAR04 Account balance VAR10 Unpaid amount in bank account deﬁcit
VAR05 Risk balance in mortgages VAR11 Unpaid amount in other products
VAR06 Risk balance in consumer loans SOC01-33 Set of 33 socio-demographic variables
Each client u has an associated class variable C
(u)
t for each time step t that
indicates if that particular client will default during the following 12 months.

Outline
1 Introduction
5 Conclusion

Dynamic Bayesian classiﬁers
A dynamic probabilistic model for doing prediction in the BCC domain
At time T (the current time), we predict the defaulting status (CT ) of a
particular client based on previous socio-economical observations and the
client’s known defaulting status λ = 12 months earlier.
in fact apply to most credit scoring problems as well as many other domains. We
will discuss this issue further in Section 5, which also serves to demonstrate the
broader relevance of the above mentioned problems.
In this paper we present a ﬁrst approach to address the BCC credit scoring
problem3
based on the use of a simple dynamic probabilistic graphical model [5].
A rough visual description of this model is given in Figure 1. Our preliminary
approach is implemented based on the AMIDST Toolbox4
. This toolbox provides
an e cient implementation of approximate inference and learning methods for
streaming data using the Bayesian networks modeling framework [5] as well as
variational Bayes inference and learning procedures [6].
CT 12 CT 11
XT 11
CT 10
XT 10
CT 1
XT 1
CT
XT
Figure 1. A dynamic probabilistic model for doing prediction in the BCC domain. At time T
(assumed to be the current time) we wish to predict the defaulting status (CT ) of a particular
customer based on previous socio-economical observations as well as the customer’s known
defaulting status = 12 months earlier. Note that due to the independence assumptions in the
Figure 1: Square/Round boxes indicate data which is available/non-available
when predicting the defaulting status of the clients at month T.

A 2-time-slices Dynamic Naïve Bayes classiﬁer
X1,t−1
Ct−1 Ct
."."."X2,t−1 Xn,t−1 X1,t
."."."X2,t Xn,t

X1,t−1
Ct−1 Ct
."."."X2,t−1 Xn,t−1 X1,t
."."."X2,t Xn,t
It assumes that only the class variables are connected across time and that all
the predictive variables at time step t are conditionally independent given the
class variable at time t.

X1,t−1
Ct−1 Ct
."."."X2,t−1 Xn,t−1 X1,t
."."."X2,t Xn,t
It assumes that only the class variables are connected across time and that all
the predictive variables at time step t are conditionally independent given the
class variable at time t.
The joint probability factorizes as
p(c1:T , x1:T ) =
T
t=1
p(ct|ct−1)
n
i=1
p (xi,t|ct) ·

Learning the model
Bayesian approach for multinomial and normally distributed data.
p (xi,t|ct) are learned from the labeled data DT−λ.
p(ct|ct−1) are learned using the class transitions from DT−λ−1 to DT−λ.

Learning the model
Bayesian approach for multinomial and normally distributed data.
p (xi,t|ct) are learned from the labeled data DT−λ.
p(ct|ct−1) are learned using the class transitions from DT−λ−1 to DT−λ.
Prediction
It amounts to calculating the conditional probability for the class label for each
client u at time T given all the information collected so far, D1:T .
p c
(u)
t |x
(u)
t−λ+1:t , c
(u)
t−λ ∝ p x
(u)
t |c
(u)
t
c
(u)
t−1
p c
(u)
t |c
(u)
t−1 p c
(u)
t−1|x
(u)
t−λ+1:t−1, c
(u)
t−λ ·

Feature subset selection
The relevance of the variables may vary over time.

We consider a wrapper feature selection method with the Naïve Bayes model
as the base classiﬁer combined with greedy search.

The area under the curve (AUC) was used as the objective function, because
AUC usually performs well even if the data has class imbalance.

The area under the curve (AUC) was used as the objective function, because
AUC usually performs well even if the data has class imbalance.
In our case, the feature selection method is performed at each time step to
infer which variables are helpful in separating defaulters from non-defaulters.

Outline
1 Introduction
5 Conclusion

AMIDST toolbox
Open source Java toolbox http://amidst.github.io/toolbox/

Predictive performance analysis
The feature subset selection helps to improve the value of the AUC.
The AUC value increases over time: the problem becomes easier to solve.0.650.700.750.800.850.900.951.00
AUC
Dynamic NB with FS
Dynamic NB
May2008
Jul2008
Sep2008
Nov2008
Jan2009
Mar2009
May2009
Jul2009
Sep2009
Nov2009
Jan2010
Mar2010
May2010
Jul2010
Sep2010
Nov2010
Jan2011
Mar2011
May2011
Jul2011
Sep2011
Nov2011
Jan2012
Mar2012
May2012
Jul2012
Sep2012
Nov2012
Jan2013
Mar2013
May2013
Jul2013
Sep2013
Nov2013
Jan2014
Mar2014
Figure 2: AUC results for the Dynamic Naive Bayes (NB) classiﬁer with and
without feature selection (FS).

Analysis of relevant features
In general, the sociodemographic features play a minor role in terms of
predictive performance.
VAR01
VAR02
VAR04
VAR05
VAR06
VAR07
VAR08
VAR09
VAR10
VAR11
SOC01
SOC02
SOC03
SOC05
SOC06
SOC07
SOC10
SOC11
SOC12
SOC14
SOC16
SOC17
SOC18
SOC20
SOC22
SOC26
SOC28
SOC31
May2007
Jul2007
Sep2007
Nov2007
Jan2008
Mar2008
May2008
Jul2008
Sep2008
Nov2008
Jan2009
Mar2009
May2009
Jul2009
Sep2009
Nov2009
Jan2010
Mar2010
May2010
Jul2010
Sep2010
Nov2010
Jan2011
Mar2011
May2011
Jul2011
Sep2011
Nov2011
Jan2012
Mar2012
May2012
Jul2012
Sep2012
Nov2012
Jan2013
Mar2013
Figure 3: The set of selected features throughout the months.

Analysis of relevant features
The most frequently selected variables consistently separate the two types of
clients, such as VAR04 and VAR08.
−0.3−0.2−0.10.00.1
VAR04
Jun2007
Sep2007
Dec2007
Mar2008
Jun2008
Sep2008
Dec2008
Mar2009
Jun2009
Sep2009
Dec2009
Mar2010
Jun2010
Sep2010
Dec2010
Mar2011
Jun2011
Sep2011
Dec2011
Mar2012
Jun2012
Sep2012
Dec2012
Mar2013
Jun2013
Sep2013
Dec2013
Mar2014
Non−defaulting
Defaulting
0.00.51.01.52.02.5
VAR08
Jun2007
Sep2007
Dec2007
Mar2008
Jun2008
Sep2008
Dec2008
Mar2009
Jun2009
Sep2009
Dec2009
Mar2010
Jun2010
Sep2010
Dec2010
Mar2011
Jun2011
Sep2011
Dec2011
Mar2012
Jun2012
Sep2012
Dec2012
Mar2013
Jun2013
Sep2013
Dec2013
Mar2014
Non−defaulting
Defaulting
Figure 4: Time-dependent averages of variables VAR04 (“Account balance”)
and VAR08 (“Unpaid amount in personal loans”) for non-defaulting and
defaulting clients.

Outline
1 Introduction
5 Conclusion

Conclusion
A ﬁrst step towards analyzing risk prediction in credit operations for the bank
Banco de Crédito Cooperativo.
A dynamic Naïve Bayes classiﬁer with a wrapper feature subset selection.

Conclusion
The feature subset selection helps to improve the results and gives insight into
which attributes are most relevant as a function of time.

Conclusion
The AMIDST toolbox performs inference and learning under a Bayesian
framework and provides functionality to improve the presented model

Conclusion
The AMIDST toolbox performs inference and learning under a Bayesian
framework and provides functionality to improve the presented model
Use of more expressive network structures
Extend the feature subset selection method to take the set of selected
features from the previous time-steps into account

Thank you for your attention
Questions?
Acknowledgments: This project has received funding from the European Union’s
Seventh Framework Programme for research, technological development and
demonstration under grant agreement no 619209

Dynamic Bayesian modeling for risk prediction in credit operations (SCAI2015)

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (10)

Similar a Dynamic Bayesian modeling for risk prediction in credit operations (SCAI2015)

Similar a Dynamic Bayesian modeling for risk prediction in credit operations (SCAI2015) (20)

Más de AMIDST Toolbox

Más de AMIDST Toolbox (8)

Último

Último (20)

Dynamic Bayesian modeling for risk prediction in credit operations (SCAI2015)