Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Próxima SlideShare
Cargando en…5
×

# Mykola Herasymovych: Optimizing Acceptance Threshold in Credit Scoring using Reinforcement Learning

154 visualizaciones

Eesti Panga teaduspreemia 2018. aasta laureaatide võidutööde tutvustus

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Sé el primero en comentar

• Sé el primero en recomendar esto

### Mykola Herasymovych: Optimizing Acceptance Threshold in Credit Scoring using Reinforcement Learning

1. 1. Optimizing Acceptance Threshold in Credit Scoring using Reinforcement Learning Student: Mykola Herasymovych Supervisors: Oliver Lukason (PhD) Karl Märka (MSc)
2. 2. Credit Scoring Problem (Crook et al., 2007; Lessmann et al. 2015; Thomas et al., 2017) • Predict the probability of a loan application being bad: Pr 𝐵𝑎𝑑 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐𝑠 𝒙} = 𝑝 𝑦 = 1|𝒙 = 𝑦 • Transform it into a credit score reflecting application’s creditworthiness level: 𝑠 𝐶𝑆 𝒙 = 𝑠 𝐶𝑆 (𝑦, 𝑧), 𝑠 𝐶𝑆 - credit score, 𝑦 - estimated probability, 𝑧 – other factors (e.g. policy rules) 2
3. 3. Credit Business Process (Creditstar Group) Loan application Estimate credit score 50% Credit score is high: give loan 50% Credit score is low: reject application Client doesn’t repay: money loss Client repays: money gain Profits change credit score 3
4. 4. Acceptance Threshold Optimization Optimizing Acceptance Threshold in Credit Scoring using Reinforcement Learning Acceptance Threshold (Viaene and Dedene, 2005; Verbraken et al., 2014; Skarestad, 2017)(Banasik et al., 2003; Wu and Hand, 2007; Dey, 2010)(Sousa et al., 2013; Bellotti and Crook, 2013; Nikolaidis, 2017) Selection BiasPopulation Drift 4
5. 5. Credit Scoring Literature 1 (Number of published articles with “credit scoring” keyword) 0 50 100 150 200 250 300 Articles by year General trend Note: adapted from Louzada et al. (2016) and updated by the author based on literature review. 5
6. 6. Credit Scoring Literature 2 (Percentage of papers published on the topic in 1992-2015) Note: adapted from Louzada et al. (2016) and updated by the author based on literature review. 0% 10% 20% 30% 40% 50% 60% New method to propose rating Comparison in traditional techinques Conceptual discussion Variable selection Literature review Performance measures Other issues Acceptance threshold optimization 0% 10% 20% 30% 40% 50% 60% New method to propose rating Comparison in traditional techinques Conceptual discussion Variable selection Literature review Performance measures Other issues Acceptance threshold optimization 0% 10% 20% 30% 40% 50% 60% New method to propose rating Comparison in traditional techinques Conceptual discussion Variable selection Literature review Performance measures Other issues Acceptance threshold optimization 0% 10% 20% 30% 40% 50% 60% New method to propose rating Comparison in traditional techinques Conceptual discussion Variable selection Literature review Performance measures Other issues Acceptance threshold optimization 6
7. 7. Shortcomings of Traditional Approach • Is static and backward looking; • Ignores credit scoring model’s performance uncertainty (Thomas et al., 2017); • Ignores selection bias (Hand, 2006; Dey, 2010); • Ignores population drift (Sousa et al., 2013; Nikolaidis, 2017); • Oversimplifies lender’s utility function (Finlay, 2010; Skarestad, 2017). 7
8. 8. Solution A Reinforcement Learning (RL) agent: • a dynamic forward-looking system • that adapts to the live data feedback • and adjusts acceptance threshold • to maximize accurately specified lender’s utility function. Reinforcement Learning 8
9. 9. RL Achievements • Forex, stocks and securities trading (Neuneier, 1996); • Resource allocation (Tesauro et al., 2006); • Tax and debt collection optimization (Abe et al., 2010); • Dynamic pricing (Kim et al., 2016); • Behavioral marketing (Sato, 2016); • Bank portfolio optimization (Strydom, 2017). • Has not been applied to the credit scoring yet, to the best of our knowledge. 9
10. 10. Where We Fit Portfolio Optimization Credit Scoring Artificial Intelligence 10
11. 11. RL parameters: 𝛼 – learning rate; 𝛾 – discount rate; 𝑡 𝑝𝑜𝑤𝑒𝑟_𝑡 – inverse scaling parameter of the learning rate; Credit Business Environment RL Agent Q-Value Function Update Rule: 𝑤 𝑎 ← 𝑤 𝑎 + 𝛼 𝑡 𝑅𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑆𝑡+1, 𝑎 − 𝑄 𝑆𝑡, 𝐴 𝑡 𝜕𝑄(𝑠,𝑎) 𝜕𝑤 𝑎 , 𝛼 𝑡 = 𝛼0 𝑡 𝑝𝑜𝑤𝑒𝑟_𝑡, 𝜕𝑄(𝑠,𝑎) 𝜕𝑤 𝑎 = 𝑆𝑡 Profit Reward 𝑅 𝑆, 𝐴 = 𝑃𝑟𝑜𝑓𝑖𝑡𝑠𝑗 𝑎 𝑗 𝑎 𝑚𝑎𝑥 𝑎= 𝑎 𝑗 𝑡 𝑗=0 Acceptance Rate State 𝑆(𝐴) = 𝑠𝑖 𝐶𝑆 ≥ 𝑡 𝐴𝑇 𝑎 𝑡−1 𝑛 𝑡 𝑖=1 𝑛 𝑡 Acceptance Threshold Action 𝐴 𝑆 = 𝜋(𝑄(𝑤, 𝑋(𝑆))) Reward 𝑅 𝑎 (𝑆, 𝐴) = 𝑃𝑟𝑜𝑓𝑖𝑡𝑠𝑗 𝑎 𝑗𝑡 𝑗=0 Q-values 𝑄 𝑤, 𝑋 = 𝑤𝑋 Q-values 𝑄 𝑎 𝑤 𝑎, 𝑋 = 𝑤 𝑎 𝑋 Action 𝐴 𝑄 = 𝜋 𝐴 𝑆) RBF Features 𝑋 𝑆 = 2 𝑘 cos 𝑤 𝑅𝐵𝐹 𝑆(𝐴) + 𝑐 𝑅𝐵𝐹 , 𝑤 𝑅𝐵𝐹 ~ 𝑁 0, 2𝛾 𝑅𝐵𝐹 , 𝑐 𝑅𝐵𝐹 ~ 𝑈(0, 2𝜋) Prediction Learning Q-Value Function 𝑄 𝜋 𝑆, 𝐴 = 𝔼 𝜋 𝛾 𝑖 𝑅 𝑡+𝑖 ∞ 𝑖=0 𝑆𝑡 = 𝑠, 𝐴 𝑡 = 𝑎 Policy 𝜋 𝐴 𝑆) = ℙ 𝐴 𝑡 = 𝑎 𝑆𝑡 = 𝑠] CS variables: 𝑛 – number of loan applications; 𝑠 𝐶𝑆 – credit score; 𝑡 𝐴𝑇 – acceptance threshold. RBF parameters: 𝑤 𝑅𝐵𝐹 – RBF weights; 𝑐 𝑅𝐵𝐹 – RBF offset values; 𝛾 𝑅𝐵𝐹 - variance parameter of a normal distribution; 𝑘 - numbers of RBF components. Policy parameters: 𝜏 – the temperature parameter of the Boltzmann distribution. Exploitative 𝜋 𝐺𝑟𝑒𝑒𝑑𝑦 (𝑄) = argmax 𝑎 𝑄 𝑆, 𝐴 Explorative 𝜋 𝐵𝑜𝑙𝑡𝑧𝑚𝑎𝑛𝑛 𝐴 𝑆 = 𝑒 𝑄(𝑆,𝐴) 𝜏 𝑒 𝑄(𝑆,𝐴′) 𝜏 𝐴′∈𝒜 Note: the process repeats at a weekly frequency: t – week number. Note: The State object summarizes characteristics of the loan portfolio. Note: The Action object is mapped to one out of 20 discrete values of acceptance threshold. Note: The policy is explorative during training episodes and exploitative during test ones. Note: Higher 𝜏 lead to a more greedy policy, while lower 𝜏 – to a more random one. Note: The Q-value function is approximated with Stochastic Gradient Descent (SGD) models. Note: The RL is less responsive during training and more responsive during test episodes. 11
12. 12. Credit Business Environment RL Agent (the dog) Profit Reward Acceptance Rate State Acceptance Threshold Action Acceptance Rate Next State Week Reinforcement Learning (RL) (Sutton and Barto, 2017) 12 Loss Reward 3 Higher Profit Reward 1010010004 12
13. 13. Learned Value Function shape (after 6000 simulated weeks of training) Notes: state denotes the application acceptance rate during the previous week, action denotes the acceptance threshold for the following week, value is the prediction of the Value Function model for a particular state-action pair, optimum shows the state-action pair that corresponds to the highest value in the state-action space. 13
14. 14. Traditional Approach (Baseline) Notes: Baseline approach follows methodology of Verbraken et al. (2014) and Skarestad (2017). 14
15. 15. Test Simulation Results 1 (shift in score distribution) Notes: figures show 100 simulation runs and their average. In each scenario the distribution of total profit differences is significantly higher than zero according to the one-tailed t-test. Profit is measured in thousands of euros. 15
16. 16. Notes: figures show 100 simulation runs and their average. In each scenario the distribution of total profit differences is significantly higher than zero according to the one-tailed t-test. Profit is measured in thousands of euros. Test Simulation Results 2 (shift in default rates) 16
17. 17. Performance on the Real Data 1 (acceptance threshold policy) Notes: figure shows the difference between action variables and the baseline action. Baseline denotes the acceptance threshold optimized using traditional approach, RL chosen denotes the one used by the RL agent, Value Function-optimal denotes the one optimal according to the Value Function model. 17
18. 18. Performance on the Real Data 2 (profits received) Note: figure shows the difference between reward variables and the baseline reward. Baseline denotes the profits received with the acceptance threshold optimized using traditional approach, RL received weekly and total denote profits received by the RL agent. Profit is measured in thousands of euros. 18
19. 19. Implications • The work improves the traditional acceptance threshold optimization approach in credit scoring of Verbraken et al. (2014) and Skarestad (2017); • Solves the problem of optimization in a dynamic partially observed credit business environment outlined in Thomas et al. (2017) and Nikolaidis (2017); • Provides more evidence on superiority of RL-based systems compared to traditional methodology in line with Strydom (2017) and Sutton and Barto (2017); • Produces practical benefit to Creditstar Group as a decision support system. 19
20. 20. Conclusions • The credit scoring literature usually omits the problem of acceptance threshold optimization, despite its significant impact on the credit business efficiency; • The traditional approach fails to optimize the acceptance threshold due to issues like population drift and selection bias; • The developed RL algorithm manages to correct for the flawed knowledge and successfully adapt to the real environment, significantly outperforming the traditional approach; • Being a proof of concept, our work describes a large room for further research and improvement of the acceptance threshold optimization issue. Q&A 20
21. 21. Supplementary Materials 21
22. 22. Acceptance Threshold Optimization Optimizing Acceptance Threshold in Credit Scoring using Reinforcement Learning Acceptance Threshold 22
23. 23. Traditional Approach 1 (Viaene and Dedene, 2005; Hand, 2009; Lessmann et al., 2015) • Construct the misclassification costs function: 𝑀𝐶 𝑡 𝐴𝑇 ; 𝑐𝐵 , 𝑐𝐺 = 𝑐𝐵 𝜋 𝐵 𝑃𝑃 (1 − 𝐹𝐵 𝑡 𝐴𝑇 ) + 𝑐𝐺 𝜋 𝐺 𝑃𝑃 𝐹𝐺(𝑡 𝐴𝑇 ) • Minimize using FOC w.r.t. acceptance threshold: 𝑓𝐵(𝑇 𝐴𝑇) 𝑓𝐺(𝑇 𝐴𝑇) = 𝜋 𝐺 𝑃𝑃 𝜋 𝐵 𝑃𝑃 𝑐𝐺 𝑐𝐵 𝑡 𝐴𝑇 – acceptance threshold, 𝑇 𝐴𝑇 – optimal acceptance threshold, 𝑐𝐵 and 𝑐𝐺 – average cost per misclassified bad (Type I error)and good (Type II error)application respectively, 𝜋 𝐺 𝑃𝑃 and 𝜋 𝐵 𝑃𝑃 – prior probabilities of being a good and a bad application respectively and 𝑓𝐺 𝑡 𝐴𝑇 and 𝑓𝐵 𝑡 𝐴𝑇 – probability density of the scores at cut-off point 𝑡 𝐴𝑇 for good and bad applications respectively. 23
24. 24. Traditional Approach 2 (Viaene and Dedene, 2005; Hand, 2009; Lessmann et al., 2015) 24 Note: Based on Crook et al. (2007), Hand (2009) and Verbraken et al. (2014). 𝑠 𝐶𝑆(𝒙) – application’s credit score estimated based on the application data 𝒙; 𝑓 𝐺 (𝑠 𝐶𝑆 ) and 𝑓 𝐵 (𝑠 𝐶𝑆) – credit score’s probability density functions of actual good and bad applications respectively; 𝑡 𝐴𝑇 – acceptance threshold for the credit score; 𝐹 𝐵 (𝑡 𝐴𝑇 ) – correctly classified bad applications; 1 − 𝐹 𝐺 (𝑡 𝐴𝑇 ) – correctly classified good applications; 1 − 𝐹 𝐵 (𝑡 𝐴𝑇 ) – bad applications misclassified as good ones; 𝐹 𝐺 (𝑡 𝐴𝑇 ) – good applications misclassified as bad ones; blue line is the estimated potential profit (in thousands of euros for illustration purposes); grey dotted lines show alternative acceptance thresholds 𝑡 𝑖 𝐴𝑇 and corresponding levels of potential profit; vertical red dotted line is the estimated optimal acceptance threshold 𝑇 𝐴𝑇 , while horizontal red dotted lines show the corresponding potential profit and shares of correctly classified and misclassified good and bad applications.
25. 25. RL Benefits • solves optimization problems with little or no prior information about the environment (Kim et al., 2016); • learns directly from the real-time data without any simplifying assumptions (Rana and Oliveira, 2015); • dynamically adjusts the policy over the learning period adapting to environmental changes (Abe et al., 2010); • avoids suffering potential costly poor performance by training in a simulated environment or learning off-policy (Aihe and Gonzalez, 2015); • satisfies contradictive performance goals (Varela et al., 2016); • was found effective in portfolio optimization problems (mainly stock and forex trading) (Strydom, 2017); 25
26. 26. Parameters: 𝛼 – learning rate; 𝛾 – discount rate. Credit Business Environment RL Agent Value Update Target: 𝑄 𝑆, 𝐴 + 𝛼[𝑅 + 𝛾max 𝑎 𝑄 𝑆′ , 𝑎 − 𝑄 𝑆, 𝐴 ] Profit Reward 𝑅(𝑆, 𝐴) Acceptance Rate State 𝑆(𝐴) Acceptance Threshold Action 𝐴(𝑆) Value Function Policy Reward 𝑅(𝑆, 𝐴) Q-values 𝑄(𝑆) Q-values 𝑄(𝑆, 𝐴) Action 𝐴(𝑄) State 𝑆(𝐴) Prediction Learning 26
27. 27. Value Function Action value function (also called Q-value function) describes an expected discounted reward of taking action a in a state s and following a policy π thereafter: 𝑄 𝜋 𝑠, 𝑎 = 𝔼 𝜋 𝑅𝑡 + 𝛾𝑅𝑡+1 + 𝛾2 𝑅𝑡+2 + ⋯ 𝑆𝑡 = 𝑠, 𝐴 𝑡 = 𝑎], where 𝛾 is a discount rate. Usually, the value function is approximated by a model. In our case, we use Gaussian Radial Basis Functions approximator and a set of Stochastic Gradient Descent models. 27
28. 28. Value Function ActionState 20 action values 2000 transformed features RBFs transformation SGD weights Policy 28
29. 29. Gaussian Radial Basis Functions (RBF) transformation: 𝑥 = 2 𝑘 cos 𝑤 𝑅𝐵𝐹 𝑠 + 𝑐 𝑅𝐵𝐹 , 𝑤 𝑅𝐵𝐹 ~ 𝑁 0, 2𝛾 𝑅𝐵𝐹 , 𝑐 𝑅𝐵𝐹 ~ 𝑈(0, 2𝜋), where 𝑥 is the resulting transformed feature vector, 𝑠 is the input state variable, 𝑘 is the number of Monte Carlo samples per original feature, 𝑤 𝑅𝐵𝐹 is a 𝑘-element vector of randomly generated RBF weights, 𝑐 𝑅𝐵𝐹 is a 𝑘-element vector of randomly generated RBF offset values and 𝛾 𝑅𝐵𝐹 is the variance parameter of a normal distribution. Stochastic Gradient Descent (SGD) model for each action: 𝑄 𝑤 𝑎, 𝑠 = 𝑤 𝑎 𝑅𝐵𝐹(𝑠) = 𝑤 𝑎 𝑥, where 𝑤 𝑎 is a vector of regression weights for action 𝑎, 𝑠 is the state variable, 𝑅𝐵𝐹 is the RBF transformation function, 𝑥 is the resulting vector of features and 𝑄 is the value of action 𝑎 in state 𝑠 corresponding to the feature vector 𝑥. Choose action according to the current policy: 𝑎 = 𝜋 𝐺𝑟𝑒𝑒𝑑𝑦 (𝑠) = argmax 𝑎 𝑄 𝑠, 𝑎 𝑎 = 𝜋 𝐵𝑜𝑙𝑡𝑧𝑚𝑎𝑛𝑛 𝑎 𝑠 = ℙ 𝐴 𝑡 = 𝑎 𝑆𝑡 = 𝑠] = 𝑒 𝑄(𝑠,𝑎) 𝜏 𝑒 𝑄(𝑠,𝑎′) 𝜏 𝑎′∈𝒜 , where 𝒜 is the set of all actions, 𝑎′ is any action except action 𝑎 and 𝜏 is the temperature parameter of the Boltzmann distribution. Forward Propagation (Prediction) 29
30. 30. Backward Propagation (Learning) The approximation error is: 𝑅𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑆𝑡+1, 𝑎 − 𝑄 𝑆𝑡, 𝐴 𝑡 To adjust the SGD model weights in the direction of the steepest error descent we use the following update rule: 𝑤 𝑎 ← 𝑤 𝑎 + 𝛼[𝑅𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑆𝑡+1, 𝑎 − 𝑄 𝑆𝑡, 𝐴 𝑡 ] 𝜕𝑄(𝑠,𝑎) 𝜕𝑤 𝑎 , which under assumption that 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑆𝑡+1, 𝑎 does not depend on 𝑤 𝑎 simplifies to the general SGD update rule: 𝑤 𝑎 ← 𝑤 𝑎 + 𝛼 𝑅𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑆𝑡+1, 𝑎 − 𝑄 𝑆𝑡, 𝐴 𝑡 𝑆𝑡, where 𝑄 𝑆𝑡, 𝐴 𝑡 can be thought of as current model prediction, 𝑅𝑡 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑆𝑡+1, 𝑎 – the target and 𝑆𝑡 – the gradient of the weights. 30
31. 31. Learning Episode -52 0 60 82 Warming-up Phase Interaction Phase Delayed Learning Phase Learning and state-action generation starts Simulation starts State-action generation ends Learning and simulation ends 31
32. 32. Note: State denotes the application acceptance rate during the previous iteration, action denotes the acceptance threshold for the following iteration, value is the prediction of the Value Function model for a particular state-action pair, optimum shows the state-action pair that corresponds to the highest value in the state-action space. Value Function Model Convergence 1st episode Whole run 32
33. 33. Result of the t-test for Various Distortion Scenarios 33 Scenario t-statistic p-value Scenario 1: downwards shift in score distribution 29.56631 1.55E-51 Scenario 2: upwards shift in score distribution 42.72066 2.45E-66 Scenario 3: downwards shift in default rates 5.172688 5.95E-07 Scenario 4: upwards shift in default rates 4.600158 6.20E-06 Note: the t-test null hypothesis is that the mean difference between the episode reward received by the RL agent and the episode reward received using the traditional approach throughout 100 episodes is equal to or lower than zero.
34. 34. Credit Scoring Literature • Thomas, L. C., D. B. Edelmann, and J. N. Crook. "Credit Scoring and Application." SIAM, Philadelphia (2017); • Crook, Jonathan N., David B. Edelman, and Lyn C. Thomas. "Recent developments in consumer credit risk assessment." European Journal of Operational Research 183.3 (2007): 1447-1465; • Hand, David J. "Measuring classifier performance: a coherent alternative to the area under the ROC curve." Machine learning 77.1 (2009): 103-123; • Verbraken, Thomas, et al. "Development and application of consumer credit scoring models using profit-based classificatio measures." European Journal of Operational Research 238.2 (2014): 505-513. • Viaene, Stijn, and Guido Dedene. "Cost-sensitive learning and decision making revisited." European journal of operational research 166.1 (2005): 212-220. • Lessmann, Stefan, et al. "Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research." European Journal of Operational Research 247.1 (2015): 124-136. • Oliver, R. M., and L. C. Thomas. "Optimal score cutoffs and pricing in regulatory capital in retail credit portfolios." (2009). • Bellotti, Tony, and Jonathan Crook. "Forecasting and stress testing credit card default using dynamic models." International Journal of Forecasting 29.4 (2013): 563-574. 34
35. 35. RL Literature• Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2017; • Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529; • Neuneier, Ralph. "Optimal asset allocation using adaptive dynamic programming." Advances in Neural Information Processing Systems. 1996; • Tesauro, Gerald, et al. "A hybrid reinforcement learning approach to autonomic resource allocation." Autonomic Computing, 2006. ICAC'06. IEEE International Conference on. IEEE, 2006; • Abe, Naoki, et al. "Optimizing debt collections using constrained reinforcement learning." Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010; • Kim, Byung-Gook, et al. "Dynamic pricing and energy consumption scheduling with reinforcement learning." IEEE Transactions on Smart Grid 7.5 (2016): 2187-2198; • Sato, Masamichi. "Quantitative Realization of Behavioral Economic Heuristics by Cognitive Category: Consumer Behavior Marketing with Reinforcement Learning." (2016); • Strydom, Petrus. "Funding optimization for a bank integrating credit and liquidity risk." Journal of Applied Finance and Banking 7.2 (2017): 1; • Aihe, David O., and Avelino J. Gonzalez. "Correcting flawed expert knowledge through reinforcement learning." Expert Systems with Applications 42.17-18 (2015): 6457-6471; • Rana, Rupal, and Fernando S. Oliveira. "Dynamic pricing policies for interdependent perishable products or services using reinforcement learning." Expert Systems with Applications 42.1 (2015): 426-436; • Varela, Martín, Omar Viera, and Franco Robledo. "A q-learning approach for investment decisions." Trends in Mathematical Economics. Springer, Cham, 2016. 347-368. 35