In this paper we perform an exploratory analysis of a finan- cial data set from a Spanish bank. Our goal is to do risk prediction in credit operations, and as data is collected continuously and reported on a monthly basis, this gives rise to a streaming data classification problem. Our analysis reveals some practical problems that have not previously been thoroughly analyzed in the context of streaming data analysis: the class labels are not immediately available and the rele- vant predictive features and entities under study (in this case the set of customers) may vary over time. In order to address these problems, we propose to use a dynamic classifier with a wrapper feature subset selection to find relevant features at di↵erent time steps. The proposed model is a special case of a more general framework that can also ac- commodate more expressive models containing latent variables as well as more sophisticated feature selection schemes.
Full text link: http://www.idi.ntnu.no/~helgel/papers/BorchaniMartinezMasegosaLangsethNielsenSalmeronFernandezMadsenSaezSCAI15.pdf
Dynamic Bayesian modeling for risk prediction in credit operations (SCAI2015)
1. Dynamic Bayesian modeling for risk
prediction in credit operations
Hanen Borchani1, Ana M. Martínez1, Andrés R. Masegosa2,
Helge Langseth2, Thomas D. Nielsen1, Antonio Salmerón3,
Antonio Fernández4, Anders L. Madsen1,5, Ramón Sáez4
1Department of Computer Science, Aalborg University, Denmark
2 Department of Computer and Information Science,
The Norwegian University of Science and Technology, Norway
3Department of Mathematics, University of Almería, Spain
4 Banco de Crédito Cooperativo, Spain
5 Hugin Expert A/S, Aalborg, Denmark
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 1
2. Outline
1 Introduction
2 The financial data set
3 Risk prediction using dynamic Bayesian networks
4 Experimental results
5 Conclusion
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 2
3. Outline
1 Introduction
2 The financial data set
3 Risk prediction using dynamic Bayesian networks
4 Experimental results
5 Conclusion
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 3
4. Introduction
Efficient solutions for risk prediction in banks can be crucial for reducing
losses due to inefficient business procedures.
Such solutions can be used as tools for monitoring the evolution of
customers in terms of credit operations risk to increase solvency of the
banking institutions.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 4
5. Introduction
Efficient solutions for risk prediction in banks can be crucial for reducing
losses due to inefficient business procedures.
Such solutions can be used as tools for monitoring the evolution of
customers in terms of credit operations risk to increase solvency of the
banking institutions.
From a machine learning perspective, credit scoring has traditionally
been approached as a supervised classification problem.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 5
6. Introduction
Efficient solutions for risk prediction in banks can be crucial for reducing
losses due to inefficient business procedures.
Such solutions can be used as tools for monitoring the evolution of
customers in terms of credit operations risk to increase solvency of the
banking institutions.
From a machine learning perspective, credit scoring has traditionally
been approached as a supervised classification problem.
However, recently, this problem presents additional challenging
characteristics that separate it from the standard supervised
classification problems.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 6
7. Challenges
Classification in a streaming context: a stream of multiple sequences received
over time, each sequence representing a particular client. That is, at every time
step t, we receive the data Dt containing information about all the clients.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 7
8. Challenges
Classification in a streaming context: a stream of multiple sequences received
over time, each sequence representing a particular client. That is, at every time
step t, we receive the data Dt containing information about all the clients.
A delayed class-feedback: the class label for each sample/client corresponds to
the client’s defaulting behavior in the following twelve months and this
information is therefore only available after a twelve month delay. Thus, the
available data is a mixture of labeled and unlabeled samples.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 8
9. Challenges
Classification in a streaming context: a stream of multiple sequences received
over time, each sequence representing a particular client. That is, at every time
step t, we receive the data Dt containing information about all the clients.
A delayed class-feedback: the class label for each sample/client corresponds to
the client’s defaulting behavior in the following twelve months and this
information is therefore only available after a twelve month delay. Thus, the
available data is a mixture of labeled and unlabeled samples.
Concept drift: the domain exhibits a form of concept drift where the data
distribution as well as the set of feature variables relevant for classification
may vary over time.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 9
10. Challenges
Classification in a streaming context: a stream of multiple sequences received
over time, each sequence representing a particular client. That is, at every time
step t, we receive the data Dt containing information about all the clients.
A delayed class-feedback: the class label for each sample/client corresponds to
the client’s defaulting behavior in the following twelve months and this
information is therefore only available after a twelve month delay. Thus, the
available data is a mixture of labeled and unlabeled samples.
Concept drift: the domain exhibits a form of concept drift where the data
distribution as well as the set of feature variables relevant for classification
may vary over time.
Objective: Explore the credit scoring problem based on a real-world data set.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 10
11. Outline
1 Introduction
2 The financial data set
3 Risk prediction using dynamic Bayesian networks
4 Experimental results
5 Conclusion
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 11
12. The financial data set
Provided by a Spanish bank in the Almería region: Banco de Crédito
Cooperativo (BCC).
It contains monthly aggregated information for a set of BCC clients for the
period from April 2007 to March 2014.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 12
13. The financial data set
Provided by a Spanish bank in the Almería region: Banco de Crédito
Cooperativo (BCC).
It contains monthly aggregated information for a set of BCC clients for the
period from April 2007 to March 2014.
Only “active” clients are considered, meaning that we restrict our attention to
individuals between 18 and 65 years of age, who have at least one automatic
bill payment or direct debit in the bank.
BCC employees are excluded since they have special conditions.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 13
14. The financial data set
Provided by a Spanish bank in the Almería region: Banco de Crédito
Cooperativo (BCC).
It contains monthly aggregated information for a set of BCC clients for the
period from April 2007 to March 2014.
Only “active” clients are considered, meaning that we restrict our attention to
individuals between 18 and 65 years of age, who have at least one automatic
bill payment or direct debit in the bank.
BCC employees are excluded since they have special conditions.
The resulting data set includes 50 000 clients each month.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 14
15. The financial data set
44 feature variables, denoted Xt, where 11 variables describing the financial
status of a client (VARXX) and 33 socio-demographic variables (SOCXX).
Variable ID Description Variable ID Description
VAR01 Total credit amount VAR07 Unpaid amount in mortgages
VAR02 Income VAR08 Unpaid amount in personal loans
VAR03 Expenses VAR09 Unpaid amount in credit cards
VAR04 Account balance VAR10 Unpaid amount in bank account deficit
VAR05 Risk balance in mortgages VAR11 Unpaid amount in other products
VAR06 Risk balance in consumer loans SOC01-33 Set of 33 socio-demographic variables
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 15
16. The financial data set
44 feature variables, denoted Xt, where 11 variables describing the financial
status of a client (VARXX) and 33 socio-demographic variables (SOCXX).
Variable ID Description Variable ID Description
VAR01 Total credit amount VAR07 Unpaid amount in mortgages
VAR02 Income VAR08 Unpaid amount in personal loans
VAR03 Expenses VAR09 Unpaid amount in credit cards
VAR04 Account balance VAR10 Unpaid amount in bank account deficit
VAR05 Risk balance in mortgages VAR11 Unpaid amount in other products
VAR06 Risk balance in consumer loans SOC01-33 Set of 33 socio-demographic variables
Each client u has an associated class variable C
(u)
t for each time step t that
indicates if that particular client will default during the following 12 months.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 16
17. Outline
1 Introduction
2 The financial data set
3 Risk prediction using dynamic Bayesian networks
4 Experimental results
5 Conclusion
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 17
18. Dynamic Bayesian classifiers
A dynamic probabilistic model for doing prediction in the BCC domain
At time T (the current time), we predict the defaulting status (CT ) of a
particular client based on previous socio-economical observations and the
client’s known defaulting status λ = 12 months earlier.
in fact apply to most credit scoring problems as well as many other domains. We
will discuss this issue further in Section 5, which also serves to demonstrate the
broader relevance of the above mentioned problems.
In this paper we present a first approach to address the BCC credit scoring
problem3
based on the use of a simple dynamic probabilistic graphical model [5].
A rough visual description of this model is given in Figure 1. Our preliminary
approach is implemented based on the AMIDST Toolbox4
. This toolbox provides
an e cient implementation of approximate inference and learning methods for
streaming data using the Bayesian networks modeling framework [5] as well as
variational Bayes inference and learning procedures [6].
CT 12 CT 11
XT 11
CT 10
XT 10
CT 1
XT 1
CT
XT
Figure 1. A dynamic probabilistic model for doing prediction in the BCC domain. At time T
(assumed to be the current time) we wish to predict the defaulting status (CT ) of a particular
customer based on previous socio-economical observations as well as the customer’s known
defaulting status = 12 months earlier. Note that due to the independence assumptions in the
Figure 1: Square/Round boxes indicate data which is available/non-available
when predicting the defaulting status of the clients at month T.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 18
19. Dynamic Bayesian classifiers
A 2-time-slices Dynamic Naïve Bayes classifier
X1,t−1
Ct−1 Ct
."."."X2,t−1 Xn,t−1 X1,t
."."."X2,t Xn,t
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 19
20. Dynamic Bayesian classifiers
A 2-time-slices Dynamic Naïve Bayes classifier
X1,t−1
Ct−1 Ct
."."."X2,t−1 Xn,t−1 X1,t
."."."X2,t Xn,t
It assumes that only the class variables are connected across time and that all
the predictive variables at time step t are conditionally independent given the
class variable at time t.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 20
21. Dynamic Bayesian classifiers
A 2-time-slices Dynamic Naïve Bayes classifier
X1,t−1
Ct−1 Ct
."."."X2,t−1 Xn,t−1 X1,t
."."."X2,t Xn,t
It assumes that only the class variables are connected across time and that all
the predictive variables at time step t are conditionally independent given the
class variable at time t.
The joint probability factorizes as
p(c1:T , x1:T ) =
T
t=1
p(ct|ct−1)
n
i=1
p (xi,t|ct) ·
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 21
22. Dynamic Bayesian classifiers
Learning the model
Bayesian approach for multinomial and normally distributed data.
p (xi,t|ct) are learned from the labeled data DT−λ.
p(ct|ct−1) are learned using the class transitions from DT−λ−1 to DT−λ.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 22
23. Dynamic Bayesian classifiers
Learning the model
Bayesian approach for multinomial and normally distributed data.
p (xi,t|ct) are learned from the labeled data DT−λ.
p(ct|ct−1) are learned using the class transitions from DT−λ−1 to DT−λ.
Prediction
It amounts to calculating the conditional probability for the class label for each
client u at time T given all the information collected so far, D1:T .
p c
(u)
t |x
(u)
t−λ+1:t , c
(u)
t−λ ∝ p x
(u)
t |c
(u)
t
c
(u)
t−1
p c
(u)
t |c
(u)
t−1 p c
(u)
t−1|x
(u)
t−λ+1:t−1, c
(u)
t−λ ·
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 23
24. Dynamic Bayesian classifiers
Feature subset selection
The relevance of the variables may vary over time.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 24
25. Dynamic Bayesian classifiers
Feature subset selection
The relevance of the variables may vary over time.
We consider a wrapper feature selection method with the Naïve Bayes model
as the base classifier combined with greedy search.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 25
26. Dynamic Bayesian classifiers
Feature subset selection
The relevance of the variables may vary over time.
We consider a wrapper feature selection method with the Naïve Bayes model
as the base classifier combined with greedy search.
The area under the curve (AUC) was used as the objective function, because
AUC usually performs well even if the data has class imbalance.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 26
27. Dynamic Bayesian classifiers
Feature subset selection
The relevance of the variables may vary over time.
We consider a wrapper feature selection method with the Naïve Bayes model
as the base classifier combined with greedy search.
The area under the curve (AUC) was used as the objective function, because
AUC usually performs well even if the data has class imbalance.
In our case, the feature selection method is performed at each time step to
infer which variables are helpful in separating defaulters from non-defaulters.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 27
28. Outline
1 Introduction
2 The financial data set
3 Risk prediction using dynamic Bayesian networks
4 Experimental results
5 Conclusion
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 28
29. AMIDST toolbox
Open source Java toolbox http://amidst.github.io/toolbox/
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 29
30. Predictive performance analysis
The feature subset selection helps to improve the value of the AUC.
The AUC value increases over time: the problem becomes easier to solve.0.650.700.750.800.850.900.951.00
AUC
Dynamic NB with FS
Dynamic NB
May2008
Jul2008
Sep2008
Nov2008
Jan2009
Mar2009
May2009
Jul2009
Sep2009
Nov2009
Jan2010
Mar2010
May2010
Jul2010
Sep2010
Nov2010
Jan2011
Mar2011
May2011
Jul2011
Sep2011
Nov2011
Jan2012
Mar2012
May2012
Jul2012
Sep2012
Nov2012
Jan2013
Mar2013
May2013
Jul2013
Sep2013
Nov2013
Jan2014
Mar2014
Figure 2: AUC results for the Dynamic Naive Bayes (NB) classifier with and
without feature selection (FS).
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 30
31. Analysis of relevant features
In general, the sociodemographic features play a minor role in terms of
predictive performance.
VAR01
VAR02
VAR04
VAR05
VAR06
VAR07
VAR08
VAR09
VAR10
VAR11
SOC01
SOC02
SOC03
SOC05
SOC06
SOC07
SOC10
SOC11
SOC12
SOC14
SOC16
SOC17
SOC18
SOC20
SOC22
SOC26
SOC28
SOC31
May2007
Jul2007
Sep2007
Nov2007
Jan2008
Mar2008
May2008
Jul2008
Sep2008
Nov2008
Jan2009
Mar2009
May2009
Jul2009
Sep2009
Nov2009
Jan2010
Mar2010
May2010
Jul2010
Sep2010
Nov2010
Jan2011
Mar2011
May2011
Jul2011
Sep2011
Nov2011
Jan2012
Mar2012
May2012
Jul2012
Sep2012
Nov2012
Jan2013
Mar2013
Figure 3: The set of selected features throughout the months.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 31
32. Analysis of relevant features
The most frequently selected variables consistently separate the two types of
clients, such as VAR04 and VAR08.
−0.3−0.2−0.10.00.1
VAR04
Jun2007
Sep2007
Dec2007
Mar2008
Jun2008
Sep2008
Dec2008
Mar2009
Jun2009
Sep2009
Dec2009
Mar2010
Jun2010
Sep2010
Dec2010
Mar2011
Jun2011
Sep2011
Dec2011
Mar2012
Jun2012
Sep2012
Dec2012
Mar2013
Jun2013
Sep2013
Dec2013
Mar2014
Non−defaulting
Defaulting
0.00.51.01.52.02.5
VAR08
Jun2007
Sep2007
Dec2007
Mar2008
Jun2008
Sep2008
Dec2008
Mar2009
Jun2009
Sep2009
Dec2009
Mar2010
Jun2010
Sep2010
Dec2010
Mar2011
Jun2011
Sep2011
Dec2011
Mar2012
Jun2012
Sep2012
Dec2012
Mar2013
Jun2013
Sep2013
Dec2013
Mar2014
Non−defaulting
Defaulting
Figure 4: Time-dependent averages of variables VAR04 (“Account balance”)
and VAR08 (“Unpaid amount in personal loans”) for non-defaulting and
defaulting clients.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 32
33. Outline
1 Introduction
2 The financial data set
3 Risk prediction using dynamic Bayesian networks
4 Experimental results
5 Conclusion
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 33
34. Conclusion
A first step towards analyzing risk prediction in credit operations for the bank
Banco de Crédito Cooperativo.
A dynamic Naïve Bayes classifier with a wrapper feature subset selection.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 34
35. Conclusion
A first step towards analyzing risk prediction in credit operations for the bank
Banco de Crédito Cooperativo.
A dynamic Naïve Bayes classifier with a wrapper feature subset selection.
The feature subset selection helps to improve the results and gives insight into
which attributes are most relevant as a function of time.
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 35
36. Conclusion
A first step towards analyzing risk prediction in credit operations for the bank
Banco de Crédito Cooperativo.
A dynamic Naïve Bayes classifier with a wrapper feature subset selection.
The feature subset selection helps to improve the results and gives insight into
which attributes are most relevant as a function of time.
The AMIDST toolbox performs inference and learning under a Bayesian
framework and provides functionality to improve the presented model
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 36
37. Conclusion
A first step towards analyzing risk prediction in credit operations for the bank
Banco de Crédito Cooperativo.
A dynamic Naïve Bayes classifier with a wrapper feature subset selection.
The feature subset selection helps to improve the results and gives insight into
which attributes are most relevant as a function of time.
The AMIDST toolbox performs inference and learning under a Bayesian
framework and provides functionality to improve the presented model
Use of more expressive network structures
Extend the feature subset selection method to take the set of selected
features from the previous time-steps into account
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 37
38. Thank you for your attention
Questions?
Acknowledgments: This project has received funding from the European Union’s
Seventh Framework Programme for research, technological development and
demonstration under grant agreement no 619209
Scandinavian Conference on Artificial Intelligence, Halmstad, November 5–6, 2015 38