SlideShare una empresa de Scribd logo
1 de 12
Descargar para leer sin conexión
Predicting Bank Customer Churn Using
Classification
Assignment - 02
RMIT University
Authors
1) Hewayalage Vishva Lahiru Kantha Abeyrathne (s3735195)
Student
RMIT University, Melbourne City Campus
s3735195@student.rmit.edu.au
2) Kodithuwakku Arachchige Iresh Udara Kaushalya (s3704769)
Student
RMIT University, Melbourne City Campus
s3704769@student.rmit.edu.au
Date of Report: 2nd of June 2019
Tableof Contents
1. Introduction..........................................................................................................................3
2. Methodology .........................................................................................................................3
2.1 Dataset.............................................................................................................................3
2.2 Data Pre-processing.......................................................................................................4
2.3 Data Exploration............................................................................................................5
2.3.1 Exploring columns ..................................................................................................5
2.3.2 Relationship of Features with Target Feature .....................................................5
2.4 Feature Selection and Ranking .....................................................................................7
2.5 Model Fitting ..................................................................................................................7
2.5.1 K-Nearest Neighbour (KNN) .................................................................................7
2.5.2 Hyper Parameter Tuning with K-Nearest Neighbour (KNN) ............................8
2.5.3 Decision Tree (DT)..................................................................................................8
2.5.4 Hyper Parameter Tuning with Decision Tree (DT).............................................8
3. Results ...................................................................................................................................9
3.1 Evaluation of K-Nearest Neighbour (KNN) with default parameters ......................9
3.2 Evaluation of Decision Tree (DT) with default parameters .......................................9
3.3 Evaluation of K-Nearest Neighbour (KNN) with Hyper Parameter Tuning .........10
3.4 Evaluation of Decision Tree (DT) with Hyper Parameter Tuning ..........................10
3.5 Confusion matrix of KNN and DT .............................................................................10
3.6 Classification Error Rate of KNN and DT ................................................................11
3.7 Precision, Recall and F1-Score of KNN and DT .......................................................11
4. Discussion............................................................................................................................12
5. Conclusion ..........................................................................................................................12
6. References...........................................................................................................................12
Abstract
The main purpose of this report was to predict customer chain using data related to a particular bank
and its customers with the support of classification. Dataset was obtained from Kaggle repository and
necessary pre-processing tasks were performed prior to classification. K-Nearest Neighbours and
Decision Tress were comparatively evaluated with parameter tuning in order to identify the best
model. Decision Tree classifier with parameter tuning was selected as the best model after evaluation
process. The report concludes that selected model performs well for unseen data of customers who are
not going to churn whereas prediction power decays with the data related to the customers who are
not going to churn. It is recommended to feed more data of customers who are going to churn in to the
model in order to have better outcomes.
1. Introduction
With the improvement of data science and data analytics fields, technologies of those have become
immensely popular in many domains as well as various industries. It is even more valuable for
banking domain to utilize the feature or capabilities of data science since they are dealing with tons of
data in daily basis. One of the main requirements of a bank is to predict the customer chain in order to
retain their most valuable customers without allowing them to move to another bank. Churn
prediction has been performed as a research in previous years due to the increasing demand of the
banking sectors [1-2]. This report will discuss the project that aims to identify the best classification
model out of K-Nearest Neighbours and Decision Trees in order to better predict the customers who
are going to churn using related data.
2. Methodology
The main steps of the model building procedure can be pointed out as data collection, data pre-
processing, training data with two classification algorithms and evaluation of the built model using
test data. Methodology is folded for subsections, section 1 analyses the dataset, section 2 describes
necessary data pre-processing steps, section 3 explores the data and their relationships using several
visualizations, section 4 discusses the classification tasks using selected algorithms. Finally,
algorithms are evaluated using test data to identify the best suited model for solving the problem.
2.1 Dataset
A dataset is related to banking domain as the main goal of this project is to predict the customer churn
using classification tasks. Dataset is obtained from Kaggle repository which is popular in the research
area of data science. It is consisted with 10,000 observations with 14 columns where each column
represents the data related to customer. ‘Exited’ is the binary target feature of the dataset which states
whether a customer is going to churn or not. All the attributes and description of the attributes along
with possible ranges are shown below in the Table 2.1.
Table 2.1. Description of the dataset
2.2 Data Pre-processing
Python Pandas functions were used to search for unnecessary whitespaces and data entry errors in the
categorical variables as the very first step under data pre-processing. In the next step, missing values
for each column were observed and there were none of them in the dataset. Summary statistics were
used to detect any outliers or impossible values in the numerical attributes. Finally, attributes such as,
Row Number, Customer ID and Surname were removed from the dataset since those attributes do not
have a impact towards the final prediction results with the classification.
Attribute Description Possible Ranges
Row Number (Integer) Row number in the dataset 1 – 10 000
Customer ID (Integer) ID of the bank customer Random Numbers
Surname (Character) Surname of the customer
Credit Score (Integer) Score based on customer
behaviour
350 - 850
Geography (Categorical) Countries of the respective
customers
France, Germany,
Spain
Gender (Categorical) Gender of the customer Male, Female
Age (Integer) Age of the customer 18 - 92
Tenure (Integer) Period of having the account in
months
0 - 10
Balance Balance of the customer’s
account
0 – 25 089.09
Number of Products (Integer) Number of accounts that
customer has
1 - 4
Has Credit Card (Categorical) Does the customer have a
credit card
Yes -1, No - 0
Is Active Member (Categorical) Is customer an active member Yes -1, No - 0
Estimated Salary Estimated salary of customer 11.58 – 199992.48
Exited (Categorical) Customer is going to churn or
not
Yes -1, No - 0
2.3 Data Exploration
2.3.1 Exploring columns
Pie char can be plotted to explore the proportions of the classes of target variable and 79.63% of data
are related to Retained customers while other 20.37% included for exited customers. Majority of the
customers are from France while second place and third place go to Germany and Spain respectively
after plotting the bar chart. It was explored that majority of the customer are from male category using
a bar chart. Further bar charts were used to observe the customers with the credit cards and active
members. Majority of the customers are active members while most of them have credit cards align
with the bank.
Numerical columns were explored using histograms and Credit Score feature has shown a symmetric
distribution with the data. Age has shown a right skewed histogram showing lesser number of old age
customers. Tenure column has displayed a almost uniform distribution while 0 and 10 having lower
values. Symmetric distribution was observed in the Balance variable apart from 0 values which
indicates customers with 0 balance. It can be stated that most of the customers have one account or
two accounts after observing the histogram of number of products variables. Finally, histogram of
Estimated Salary has plotted, and it was consisted with a uniform distribution spreading different
ranges of estimated salaries of the customers.
2.3.2 Relationship of Features with Target Feature
This section mainly focuses on the relationship of all the features with target feature in order to find
out the impact of those feature respective to final outcome of the prediction.
Figure 2.3.2.1. Relationship between Categorical Features and Target Feature
As it can be observed from the above figure, France has the majority of the customers dealing with
the bank while Germany has the lowest proportion. Graph was plotted with the expectation to have
France as the country to produce more customers who are going to churn. But, it has almost an
inverse relationship with the target feature where customers from Germany has the highest probability
of leaving the bank. It is required for bank to focus more on countries where there are fewer
customers.
Majority of the customers are from Male category and expected to have more from male category to
leave the bank. Surprisingly, Female category has a better chance of moving on to another bank
according to the above figure. Focusing more woman or supplying benefits for them with their
accounts would be a way to stop them churning.
Expectation was there to observe customers with no credit cards more in the churned proportion.
However, Customers with the credit card have a much probability of churning compared to customers
who do not have any credits cards with the bank. It is surprising factor associated with the dataset.
As it can be explored in the above figure, as expected, customers who are not active with the bank
have the highest probability of leaving the bank and there should be a way to engage them with the
bank.
Figure 2.3.2.2. Relation between Numerical Features and Target Feature
As expected,much of a difference cannot be obtained from the above figure where there is no impact
for the target feature with the credit score of the customers. According to the above figure, older
customers tend to churn at a rapid pace compared to younger customers and bank need to reconsider
their approach when it comes to target market. It was against the hypothesis made when plotting the
graph.
Customer who have spent a little time or much longer time have a better chance of churning regard to
that of customer who have an average tenure with the bank, and it is an expected result.
According to the above figure and as it was not intended, bank is more likely to lose customers with
higher account balance and it would be a negative effect towards the bank and necessary facilities, or
various loan options have to be introduced to retain those valuable customers. Number of accounts
does not seem to have an impact on target feature as the same hypothesis was made. From the above
figure, it is obvious that estimated salary of a customer does not have much of a prediction power
towards final outcome as it was expected according to the hypothesis.
2.4 Feature Selection and Ranking
Features are selected according to their importance or contribution towards final outcome using
Random Forest Importance. Attributes Age' 'Estimated Salary' 'Credit Score' 'Balance' 'Number of
Products' gained maximum score according to the order while other five attributes shared almost a
similar score compared to higher score. Finally, all 1o attributed were considered for model building
phase.
2.5 Model Fitting
Since this is a classification problem, K-Nearest Neighbour (KNN) and Decision Tree (DT) classifiers
are used to fit the model with banking data in order to identify the best classifier out of the two. Prior
to fit the data using selected classifiers, categorical attributes in the dataset were encoded to numerical
attributes using one-hot encoding mechanism in order to feed in to algorithms.
2.5.1 K-Nearest Neighbour (KNN)
As the first step of training the data with KNN algorithms, it is required to specify an adequate K
value to achieve higher results. K-folds cross validation was used to find out the best k value using
range of values 1 – 7. Optimal value was selected as 5 as it gave the lowest misclassification error.
Figure 2.5.1.1. Result of misclassification error vs k.
Dataset was trained using KNN algorithm having severaltrain/test splits as 80%:20%, 60%:40% and
50%:50% in order to identify the best split. Default parameters were used as the initial step.
2.5.2 Hyper Parameter Tuning with K-Nearest Neighbour (KNN)
Having identified the best train/test split with default parameter setting, parameter values were
changed with different values in order to maximise the performance of the KNN model. Distance
based methods were used instead of default ‘uniform’ parameter with a p value range of [0-2] in order
to have Minkowski, Manhattan and Euclidean distance-based methods respectively in contention.
2.5.3 Decision Tree (DT)
After KNN,data were separately trained using DT with default parameters having same train/test
splits (80%:20%, 60%:40% and 50%:50%) as KNN.
2.5.4 Hyper Parameter Tuning with Decision Tree (DT)
Default parameter values were changed where maximum depth for tree was 3,minimum number of
samples for a leaf node was 5 and minimum number of sample split was 2.
3. Results
Evaluation results of both the classification models are discussed in this section.
3.1 Evaluation of K-Nearest Neighbour (KNN) withdefault parameters
Table 3.1.1. Classification Results of KNN model
Classification Model Train/Test Split Accuracy
KNN 50% - 50 % 75.1%
KNN 60% - 40 % 75.63%
KNN 80% - 20 % 76.45 %
Maximum accuracy results were produced by 80% - 20% train/test split with an accuracy level of
76.45%
3.2 Evaluation of Decision Tree (DT) with default parameters
Table 3.2.1. Classification Results of DT model
Classification Model Train/Test Split Accuracy
DT 50% - 50 % 78.42%
DT 60% - 40 % 78.23%
DT 80% - 20 % 79.5%
80% - 20% train/test split provided maximum accuracy for DT as well with 79.5% accuracy level.
3.3 Evaluation of K-Nearest Neighbour (KNN) withHyper Parameter Tuning
Table 3.3.1 Classification Results of KNN model with parameter tuning
Classification Model Accuracy
KNN(Minkowski) 74.0%
KNN (Manhattan) 74.7%
KNN (Euclidean) 74.0 %
With the parameter tuning, Manhattan distance method provided better results of 74.7% compared to
other distance-based methods. But it was not able to surpass the accuracy level which was obtained
using default parameter settings.
3.4 Evaluation of Decision Tree (DT) with Hyper Parameter Tuning
Table 3.4.1. Classification Results of DT model with parameter tuning
Classification Model Parameters Accuracy
DT Max depth=3,
Min sample split=2,
Min sample leaf=5
84.25%
With the change of parameter values DT algorithm performed well for this particular dataset with a
higher accuracy rate of 84.25%.
3.5 Confusion matrix of KNN and DT
Confusion matrix can be identified as a measurement to assess the performance of a model.
[[1430 65]
[ 355 50]]
Above confusion matrix was obtained from KNN model with default parameters which provided the
best accuracy. With the matrix, it is obvious that model predicts class value 0 as 0 for 1430 times
while it predicts 0 as 1 for 65 times. Model does not look better prediction class value of 1 where it
predicts value 1 as 1 for only 50 times while it predicts 1 as 0 for 355 times.
[[1576 19]
[ 296 109]]
Above matrix shows the results of DT model with hyper parameter tuning which had the best
accuracy for that model. With the results, it is clear that DT also predicts class value of 0 with great
accuracy level while prediction of the class value 1 is pretty low.
3.6 Classification Error Rate of KNN andDT
Classification error can be stated as 1 – accuracy as the simplest term. Having assessed the results of
KNN and DT models in the previous sections, both had best accuracy levels 0f 76.45% and 84.25%
respectively. Therefore, error rate of KNN model is 23.55% whereas DT has the error rate of 15.75%.
3.7 Precision, Recall and F1-Score of KNN and DT
Precision Recall and F1-Score are measures to interpret a model performance. Precision can be stated
as the ratio correctly predicted positive observations to total predicted positive observations. Recall is
the ratio of correctly predicted positive observation to all observations in class ‘yes’. F1-Score is a
weighted average of both precision and Recall.
Figure 3.7.1. Performance measurements of optimized KNN
Figure3.7.2. Performance measurements of optimized DT
Having analysed all the performance measurements,it can be stated that Both the algorithms work
better for class 0 while it is not ideal for predicting class 1 though DT have some capability of that
over KNN.
4. Discussion
Having assessed the results for both the classification models, Decision Tree classifier can be selected
as the best model for this project with an accuracy rate of 84.25% surpassing 76.45% accuracy level
of K-Nearest Neighbour classifier. It can be considered as a very high accuracy rate considering the
data as well. Only drawback that can be found from the model is its inability to predict the class 1
with high accuracy. Precision value of the class confirmed it with a lower value compared to class 0.
Class 1 of target feature relates to the customers who are going to churn, and this model works well to
predict the customers who are not going to change this bank to another.
5. Conclusion
In this report, comparative study between K-Nearest Neighbours classifier and Decision Tree
classifier were considered to predict the customers who are going to move to another bank using
banking data in Kaggle repository. Decision Tree model with parameter tuning was selected as the
best model with the accuracy results and performance measures. Selected model lacks the ability to
predict customers who are going to churn while performing well to customers who are not going to
churn. This can be further improved by feeding more data related that particular class over time.
6. References
[11 Hadden, J.,Tiwari, A.,Roy, R., & Ruta, D. (2007). Computer assisted customer churn
management: State-of-the-art and future trends. Computers & Operations Research,34(10), 2902-
2917.
[2] Sayed, H., Abdel-Fattah, M. A.,& Kholief, S. (2018). Predicting Potential Banking Customer
Churn using Apache Spark ML and MLlib Packages:A Comparative Study. INTERNATIONAL
JOURNAL OF ADVANCEDCOMPUTER SCIENCE ANDAPPLICATIONS,9(11), 674-677.

Más contenido relacionado

La actualidad más candente

Prediction of potential customers for term deposit
Prediction of potential customers for term depositPrediction of potential customers for term deposit
Prediction of potential customers for term depositPranov Mishra
 
E-commerce online review for detecting influencing factors users perception
E-commerce online review for detecting influencing factors users perceptionE-commerce online review for detecting influencing factors users perception
E-commerce online review for detecting influencing factors users perceptionjournalBEEI
 
Automation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep LearningAutomation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep LearningPranov Mishra
 
IRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET Journal
 
Threshold Secure B2B Model
Threshold Secure B2B ModelThreshold Secure B2B Model
Threshold Secure B2B ModelIOSR Journals
 
Customer Clustering Based on Customer Purchasing Sequence Data
Customer Clustering Based on Customer Purchasing Sequence DataCustomer Clustering Based on Customer Purchasing Sequence Data
Customer Clustering Based on Customer Purchasing Sequence DataIJERA Editor
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom IndustrySatyam Barsaiyan
 
Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industryskewdlogix
 
Customer Churn Analysis and Prediction
Customer Churn Analysis and PredictionCustomer Churn Analysis and Prediction
Customer Churn Analysis and PredictionSOUMIT KAR
 
Prediction of Default Customer in Banking Sector using Artificial Neural Network
Prediction of Default Customer in Banking Sector using Artificial Neural NetworkPrediction of Default Customer in Banking Sector using Artificial Neural Network
Prediction of Default Customer in Banking Sector using Artificial Neural Networkrahulmonikasharma
 
The disruptometer: an artificial intelligence algorithm for market insights
The disruptometer: an artificial intelligence algorithm for market insightsThe disruptometer: an artificial intelligence algorithm for market insights
The disruptometer: an artificial intelligence algorithm for market insightsjournalBEEI
 
Project crm submission sonali
Project crm submission sonaliProject crm submission sonali
Project crm submission sonaliSonali Gupta
 
Customer Segmentation Project
Customer Segmentation ProjectCustomer Segmentation Project
Customer Segmentation ProjectAditya Ekawade
 
Data Mining Techniques for CRM
Data Mining Techniques for CRMData Mining Techniques for CRM
Data Mining Techniques for CRMShyaamini Balu
 
Final presentation
Final presentationFinal presentation
Final presentationssuser8e5ee2
 
Telecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analyticsTelecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analyticssheetal sharma
 
Sales Performance Deep Dive and Forecast: A ML Driven Analytics Solution
Sales Performance Deep Dive and Forecast: A ML Driven Analytics SolutionSales Performance Deep Dive and Forecast: A ML Driven Analytics Solution
Sales Performance Deep Dive and Forecast: A ML Driven Analytics SolutionPranov Mishra
 

La actualidad más candente (19)

Hy2208 Final
Hy2208 FinalHy2208 Final
Hy2208 Final
 
20 ccp using logistic
20 ccp using logistic20 ccp using logistic
20 ccp using logistic
 
Prediction of potential customers for term deposit
Prediction of potential customers for term depositPrediction of potential customers for term deposit
Prediction of potential customers for term deposit
 
E-commerce online review for detecting influencing factors users perception
E-commerce online review for detecting influencing factors users perceptionE-commerce online review for detecting influencing factors users perception
E-commerce online review for detecting influencing factors users perception
 
Automation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep LearningAutomation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep Learning
 
IRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom Industry
 
Threshold Secure B2B Model
Threshold Secure B2B ModelThreshold Secure B2B Model
Threshold Secure B2B Model
 
Customer Clustering Based on Customer Purchasing Sequence Data
Customer Clustering Based on Customer Purchasing Sequence DataCustomer Clustering Based on Customer Purchasing Sequence Data
Customer Clustering Based on Customer Purchasing Sequence Data
 
Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom Industry
 
Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industry
 
Customer Churn Analysis and Prediction
Customer Churn Analysis and PredictionCustomer Churn Analysis and Prediction
Customer Churn Analysis and Prediction
 
Prediction of Default Customer in Banking Sector using Artificial Neural Network
Prediction of Default Customer in Banking Sector using Artificial Neural NetworkPrediction of Default Customer in Banking Sector using Artificial Neural Network
Prediction of Default Customer in Banking Sector using Artificial Neural Network
 
The disruptometer: an artificial intelligence algorithm for market insights
The disruptometer: an artificial intelligence algorithm for market insightsThe disruptometer: an artificial intelligence algorithm for market insights
The disruptometer: an artificial intelligence algorithm for market insights
 
Project crm submission sonali
Project crm submission sonaliProject crm submission sonali
Project crm submission sonali
 
Customer Segmentation Project
Customer Segmentation ProjectCustomer Segmentation Project
Customer Segmentation Project
 
Data Mining Techniques for CRM
Data Mining Techniques for CRMData Mining Techniques for CRM
Data Mining Techniques for CRM
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Telecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analyticsTelecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analytics
 
Sales Performance Deep Dive and Forecast: A ML Driven Analytics Solution
Sales Performance Deep Dive and Forecast: A ML Driven Analytics SolutionSales Performance Deep Dive and Forecast: A ML Driven Analytics Solution
Sales Performance Deep Dive and Forecast: A ML Driven Analytics Solution
 

Similar a Report 190804110930

An efficient data pre processing frame work for loan credibility prediction s...
An efficient data pre processing frame work for loan credibility prediction s...An efficient data pre processing frame work for loan credibility prediction s...
An efficient data pre processing frame work for loan credibility prediction s...eSAT Journals
 
A Review of deep learning techniques in detection of anomaly incredit card tr...
A Review of deep learning techniques in detection of anomaly incredit card tr...A Review of deep learning techniques in detection of anomaly incredit card tr...
A Review of deep learning techniques in detection of anomaly incredit card tr...IRJET Journal
 
A Research Paper on Credit Card Fraud Detection
A Research Paper on Credit Card Fraud DetectionA Research Paper on Credit Card Fraud Detection
A Research Paper on Credit Card Fraud DetectionIRJET Journal
 
Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...
Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...
Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...Eswar Publications
 
DEMOGRAPHIC DIVISION OF A MART BY APPLYING CLUSTERING TECHNIQUES
DEMOGRAPHIC DIVISION OF A MART BY APPLYING CLUSTERING TECHNIQUESDEMOGRAPHIC DIVISION OF A MART BY APPLYING CLUSTERING TECHNIQUES
DEMOGRAPHIC DIVISION OF A MART BY APPLYING CLUSTERING TECHNIQUESIRJET Journal
 
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...IRJET Journal
 
IRJET- Ad-Click Prediction using Prediction Algorithm: Machine Learning Approach
IRJET- Ad-Click Prediction using Prediction Algorithm: Machine Learning ApproachIRJET- Ad-Click Prediction using Prediction Algorithm: Machine Learning Approach
IRJET- Ad-Click Prediction using Prediction Algorithm: Machine Learning ApproachIRJET Journal
 
3Individual Assignment Social, Ethical and Legal Implicat.docx
3Individual Assignment Social, Ethical and Legal Implicat.docx3Individual Assignment Social, Ethical and Legal Implicat.docx
3Individual Assignment Social, Ethical and Legal Implicat.docxrhetttrevannion
 
Review on Sentiment Analysis on Customer Reviews
Review on Sentiment Analysis on Customer ReviewsReview on Sentiment Analysis on Customer Reviews
Review on Sentiment Analysis on Customer ReviewsIRJET Journal
 
NMIMS Semester 1 Assignment Solution Dec 2021
NMIMS  Semester 1 Assignment Solution Dec  2021 NMIMS  Semester 1 Assignment Solution Dec  2021
NMIMS Semester 1 Assignment Solution Dec 2021 palaniappann
 
IRJET- E-Commerce Recommender System using Data Mining Algorithms
IRJET-  	  E-Commerce Recommender System using Data Mining AlgorithmsIRJET-  	  E-Commerce Recommender System using Data Mining Algorithms
IRJET- E-Commerce Recommender System using Data Mining AlgorithmsIRJET Journal
 
Bank Customer Segmentation & Insurance Claim Prediction
Bank Customer Segmentation & Insurance Claim PredictionBank Customer Segmentation & Insurance Claim Prediction
Bank Customer Segmentation & Insurance Claim PredictionIRJET Journal
 
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...IRJET Journal
 
Automated Feature Selection and Churn Prediction using Deep Learning Models
Automated Feature Selection and Churn Prediction using Deep Learning ModelsAutomated Feature Selection and Churn Prediction using Deep Learning Models
Automated Feature Selection and Churn Prediction using Deep Learning ModelsIRJET Journal
 
IRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET Journal
 
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasProf Dr Mehmed ERDAS
 
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasProf Dr Mehmed ERDAS
 

Similar a Report 190804110930 (20)

An efficient data pre processing frame work for loan credibility prediction s...
An efficient data pre processing frame work for loan credibility prediction s...An efficient data pre processing frame work for loan credibility prediction s...
An efficient data pre processing frame work for loan credibility prediction s...
 
69.pdf
69.pdf69.pdf
69.pdf
 
A Review of deep learning techniques in detection of anomaly incredit card tr...
A Review of deep learning techniques in detection of anomaly incredit card tr...A Review of deep learning techniques in detection of anomaly incredit card tr...
A Review of deep learning techniques in detection of anomaly incredit card tr...
 
A Research Paper on Credit Card Fraud Detection
A Research Paper on Credit Card Fraud DetectionA Research Paper on Credit Card Fraud Detection
A Research Paper on Credit Card Fraud Detection
 
Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...
Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...
Optimized Feature Extraction and Actionable Knowledge Discovery for Customer ...
 
DEMOGRAPHIC DIVISION OF A MART BY APPLYING CLUSTERING TECHNIQUES
DEMOGRAPHIC DIVISION OF A MART BY APPLYING CLUSTERING TECHNIQUESDEMOGRAPHIC DIVISION OF A MART BY APPLYING CLUSTERING TECHNIQUES
DEMOGRAPHIC DIVISION OF A MART BY APPLYING CLUSTERING TECHNIQUES
 
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
 
IRJET- Ad-Click Prediction using Prediction Algorithm: Machine Learning Approach
IRJET- Ad-Click Prediction using Prediction Algorithm: Machine Learning ApproachIRJET- Ad-Click Prediction using Prediction Algorithm: Machine Learning Approach
IRJET- Ad-Click Prediction using Prediction Algorithm: Machine Learning Approach
 
3Individual Assignment Social, Ethical and Legal Implicat.docx
3Individual Assignment Social, Ethical and Legal Implicat.docx3Individual Assignment Social, Ethical and Legal Implicat.docx
3Individual Assignment Social, Ethical and Legal Implicat.docx
 
Review on Sentiment Analysis on Customer Reviews
Review on Sentiment Analysis on Customer ReviewsReview on Sentiment Analysis on Customer Reviews
Review on Sentiment Analysis on Customer Reviews
 
NMIMS Semester 1 Assignment Solution Dec 2021
NMIMS  Semester 1 Assignment Solution Dec  2021 NMIMS  Semester 1 Assignment Solution Dec  2021
NMIMS Semester 1 Assignment Solution Dec 2021
 
IRJET- E-Commerce Recommender System using Data Mining Algorithms
IRJET-  	  E-Commerce Recommender System using Data Mining AlgorithmsIRJET-  	  E-Commerce Recommender System using Data Mining Algorithms
IRJET- E-Commerce Recommender System using Data Mining Algorithms
 
Bank Customer Segmentation & Insurance Claim Prediction
Bank Customer Segmentation & Insurance Claim PredictionBank Customer Segmentation & Insurance Claim Prediction
Bank Customer Segmentation & Insurance Claim Prediction
 
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
 
Clustering
ClusteringClustering
Clustering
 
Automated Feature Selection and Churn Prediction using Deep Learning Models
Automated Feature Selection and Churn Prediction using Deep Learning ModelsAutomated Feature Selection and Churn Prediction using Deep Learning Models
Automated Feature Selection and Churn Prediction using Deep Learning Models
 
IRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data Science
 
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
 
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
 
Essay On Math 533
Essay On Math 533Essay On Math 533
Essay On Math 533
 

Último

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 

Último (20)

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 

Report 190804110930

  • 1. Predicting Bank Customer Churn Using Classification Assignment - 02 RMIT University Authors 1) Hewayalage Vishva Lahiru Kantha Abeyrathne (s3735195) Student RMIT University, Melbourne City Campus s3735195@student.rmit.edu.au 2) Kodithuwakku Arachchige Iresh Udara Kaushalya (s3704769) Student RMIT University, Melbourne City Campus s3704769@student.rmit.edu.au Date of Report: 2nd of June 2019
  • 2. Tableof Contents 1. Introduction..........................................................................................................................3 2. Methodology .........................................................................................................................3 2.1 Dataset.............................................................................................................................3 2.2 Data Pre-processing.......................................................................................................4 2.3 Data Exploration............................................................................................................5 2.3.1 Exploring columns ..................................................................................................5 2.3.2 Relationship of Features with Target Feature .....................................................5 2.4 Feature Selection and Ranking .....................................................................................7 2.5 Model Fitting ..................................................................................................................7 2.5.1 K-Nearest Neighbour (KNN) .................................................................................7 2.5.2 Hyper Parameter Tuning with K-Nearest Neighbour (KNN) ............................8 2.5.3 Decision Tree (DT)..................................................................................................8 2.5.4 Hyper Parameter Tuning with Decision Tree (DT).............................................8 3. Results ...................................................................................................................................9 3.1 Evaluation of K-Nearest Neighbour (KNN) with default parameters ......................9 3.2 Evaluation of Decision Tree (DT) with default parameters .......................................9 3.3 Evaluation of K-Nearest Neighbour (KNN) with Hyper Parameter Tuning .........10 3.4 Evaluation of Decision Tree (DT) with Hyper Parameter Tuning ..........................10 3.5 Confusion matrix of KNN and DT .............................................................................10 3.6 Classification Error Rate of KNN and DT ................................................................11 3.7 Precision, Recall and F1-Score of KNN and DT .......................................................11 4. Discussion............................................................................................................................12 5. Conclusion ..........................................................................................................................12 6. References...........................................................................................................................12
  • 3. Abstract The main purpose of this report was to predict customer chain using data related to a particular bank and its customers with the support of classification. Dataset was obtained from Kaggle repository and necessary pre-processing tasks were performed prior to classification. K-Nearest Neighbours and Decision Tress were comparatively evaluated with parameter tuning in order to identify the best model. Decision Tree classifier with parameter tuning was selected as the best model after evaluation process. The report concludes that selected model performs well for unseen data of customers who are not going to churn whereas prediction power decays with the data related to the customers who are not going to churn. It is recommended to feed more data of customers who are going to churn in to the model in order to have better outcomes. 1. Introduction With the improvement of data science and data analytics fields, technologies of those have become immensely popular in many domains as well as various industries. It is even more valuable for banking domain to utilize the feature or capabilities of data science since they are dealing with tons of data in daily basis. One of the main requirements of a bank is to predict the customer chain in order to retain their most valuable customers without allowing them to move to another bank. Churn prediction has been performed as a research in previous years due to the increasing demand of the banking sectors [1-2]. This report will discuss the project that aims to identify the best classification model out of K-Nearest Neighbours and Decision Trees in order to better predict the customers who are going to churn using related data. 2. Methodology The main steps of the model building procedure can be pointed out as data collection, data pre- processing, training data with two classification algorithms and evaluation of the built model using test data. Methodology is folded for subsections, section 1 analyses the dataset, section 2 describes necessary data pre-processing steps, section 3 explores the data and their relationships using several visualizations, section 4 discusses the classification tasks using selected algorithms. Finally, algorithms are evaluated using test data to identify the best suited model for solving the problem. 2.1 Dataset A dataset is related to banking domain as the main goal of this project is to predict the customer churn using classification tasks. Dataset is obtained from Kaggle repository which is popular in the research area of data science. It is consisted with 10,000 observations with 14 columns where each column represents the data related to customer. ‘Exited’ is the binary target feature of the dataset which states whether a customer is going to churn or not. All the attributes and description of the attributes along with possible ranges are shown below in the Table 2.1.
  • 4. Table 2.1. Description of the dataset 2.2 Data Pre-processing Python Pandas functions were used to search for unnecessary whitespaces and data entry errors in the categorical variables as the very first step under data pre-processing. In the next step, missing values for each column were observed and there were none of them in the dataset. Summary statistics were used to detect any outliers or impossible values in the numerical attributes. Finally, attributes such as, Row Number, Customer ID and Surname were removed from the dataset since those attributes do not have a impact towards the final prediction results with the classification. Attribute Description Possible Ranges Row Number (Integer) Row number in the dataset 1 – 10 000 Customer ID (Integer) ID of the bank customer Random Numbers Surname (Character) Surname of the customer Credit Score (Integer) Score based on customer behaviour 350 - 850 Geography (Categorical) Countries of the respective customers France, Germany, Spain Gender (Categorical) Gender of the customer Male, Female Age (Integer) Age of the customer 18 - 92 Tenure (Integer) Period of having the account in months 0 - 10 Balance Balance of the customer’s account 0 – 25 089.09 Number of Products (Integer) Number of accounts that customer has 1 - 4 Has Credit Card (Categorical) Does the customer have a credit card Yes -1, No - 0 Is Active Member (Categorical) Is customer an active member Yes -1, No - 0 Estimated Salary Estimated salary of customer 11.58 – 199992.48 Exited (Categorical) Customer is going to churn or not Yes -1, No - 0
  • 5. 2.3 Data Exploration 2.3.1 Exploring columns Pie char can be plotted to explore the proportions of the classes of target variable and 79.63% of data are related to Retained customers while other 20.37% included for exited customers. Majority of the customers are from France while second place and third place go to Germany and Spain respectively after plotting the bar chart. It was explored that majority of the customer are from male category using a bar chart. Further bar charts were used to observe the customers with the credit cards and active members. Majority of the customers are active members while most of them have credit cards align with the bank. Numerical columns were explored using histograms and Credit Score feature has shown a symmetric distribution with the data. Age has shown a right skewed histogram showing lesser number of old age customers. Tenure column has displayed a almost uniform distribution while 0 and 10 having lower values. Symmetric distribution was observed in the Balance variable apart from 0 values which indicates customers with 0 balance. It can be stated that most of the customers have one account or two accounts after observing the histogram of number of products variables. Finally, histogram of Estimated Salary has plotted, and it was consisted with a uniform distribution spreading different ranges of estimated salaries of the customers. 2.3.2 Relationship of Features with Target Feature This section mainly focuses on the relationship of all the features with target feature in order to find out the impact of those feature respective to final outcome of the prediction. Figure 2.3.2.1. Relationship between Categorical Features and Target Feature
  • 6. As it can be observed from the above figure, France has the majority of the customers dealing with the bank while Germany has the lowest proportion. Graph was plotted with the expectation to have France as the country to produce more customers who are going to churn. But, it has almost an inverse relationship with the target feature where customers from Germany has the highest probability of leaving the bank. It is required for bank to focus more on countries where there are fewer customers. Majority of the customers are from Male category and expected to have more from male category to leave the bank. Surprisingly, Female category has a better chance of moving on to another bank according to the above figure. Focusing more woman or supplying benefits for them with their accounts would be a way to stop them churning. Expectation was there to observe customers with no credit cards more in the churned proportion. However, Customers with the credit card have a much probability of churning compared to customers who do not have any credits cards with the bank. It is surprising factor associated with the dataset. As it can be explored in the above figure, as expected, customers who are not active with the bank have the highest probability of leaving the bank and there should be a way to engage them with the bank. Figure 2.3.2.2. Relation between Numerical Features and Target Feature
  • 7. As expected,much of a difference cannot be obtained from the above figure where there is no impact for the target feature with the credit score of the customers. According to the above figure, older customers tend to churn at a rapid pace compared to younger customers and bank need to reconsider their approach when it comes to target market. It was against the hypothesis made when plotting the graph. Customer who have spent a little time or much longer time have a better chance of churning regard to that of customer who have an average tenure with the bank, and it is an expected result. According to the above figure and as it was not intended, bank is more likely to lose customers with higher account balance and it would be a negative effect towards the bank and necessary facilities, or various loan options have to be introduced to retain those valuable customers. Number of accounts does not seem to have an impact on target feature as the same hypothesis was made. From the above figure, it is obvious that estimated salary of a customer does not have much of a prediction power towards final outcome as it was expected according to the hypothesis. 2.4 Feature Selection and Ranking Features are selected according to their importance or contribution towards final outcome using Random Forest Importance. Attributes Age' 'Estimated Salary' 'Credit Score' 'Balance' 'Number of Products' gained maximum score according to the order while other five attributes shared almost a similar score compared to higher score. Finally, all 1o attributed were considered for model building phase. 2.5 Model Fitting Since this is a classification problem, K-Nearest Neighbour (KNN) and Decision Tree (DT) classifiers are used to fit the model with banking data in order to identify the best classifier out of the two. Prior to fit the data using selected classifiers, categorical attributes in the dataset were encoded to numerical attributes using one-hot encoding mechanism in order to feed in to algorithms. 2.5.1 K-Nearest Neighbour (KNN) As the first step of training the data with KNN algorithms, it is required to specify an adequate K value to achieve higher results. K-folds cross validation was used to find out the best k value using range of values 1 – 7. Optimal value was selected as 5 as it gave the lowest misclassification error.
  • 8. Figure 2.5.1.1. Result of misclassification error vs k. Dataset was trained using KNN algorithm having severaltrain/test splits as 80%:20%, 60%:40% and 50%:50% in order to identify the best split. Default parameters were used as the initial step. 2.5.2 Hyper Parameter Tuning with K-Nearest Neighbour (KNN) Having identified the best train/test split with default parameter setting, parameter values were changed with different values in order to maximise the performance of the KNN model. Distance based methods were used instead of default ‘uniform’ parameter with a p value range of [0-2] in order to have Minkowski, Manhattan and Euclidean distance-based methods respectively in contention. 2.5.3 Decision Tree (DT) After KNN,data were separately trained using DT with default parameters having same train/test splits (80%:20%, 60%:40% and 50%:50%) as KNN. 2.5.4 Hyper Parameter Tuning with Decision Tree (DT) Default parameter values were changed where maximum depth for tree was 3,minimum number of samples for a leaf node was 5 and minimum number of sample split was 2.
  • 9. 3. Results Evaluation results of both the classification models are discussed in this section. 3.1 Evaluation of K-Nearest Neighbour (KNN) withdefault parameters Table 3.1.1. Classification Results of KNN model Classification Model Train/Test Split Accuracy KNN 50% - 50 % 75.1% KNN 60% - 40 % 75.63% KNN 80% - 20 % 76.45 % Maximum accuracy results were produced by 80% - 20% train/test split with an accuracy level of 76.45% 3.2 Evaluation of Decision Tree (DT) with default parameters Table 3.2.1. Classification Results of DT model Classification Model Train/Test Split Accuracy DT 50% - 50 % 78.42% DT 60% - 40 % 78.23% DT 80% - 20 % 79.5% 80% - 20% train/test split provided maximum accuracy for DT as well with 79.5% accuracy level.
  • 10. 3.3 Evaluation of K-Nearest Neighbour (KNN) withHyper Parameter Tuning Table 3.3.1 Classification Results of KNN model with parameter tuning Classification Model Accuracy KNN(Minkowski) 74.0% KNN (Manhattan) 74.7% KNN (Euclidean) 74.0 % With the parameter tuning, Manhattan distance method provided better results of 74.7% compared to other distance-based methods. But it was not able to surpass the accuracy level which was obtained using default parameter settings. 3.4 Evaluation of Decision Tree (DT) with Hyper Parameter Tuning Table 3.4.1. Classification Results of DT model with parameter tuning Classification Model Parameters Accuracy DT Max depth=3, Min sample split=2, Min sample leaf=5 84.25% With the change of parameter values DT algorithm performed well for this particular dataset with a higher accuracy rate of 84.25%. 3.5 Confusion matrix of KNN and DT Confusion matrix can be identified as a measurement to assess the performance of a model. [[1430 65] [ 355 50]] Above confusion matrix was obtained from KNN model with default parameters which provided the best accuracy. With the matrix, it is obvious that model predicts class value 0 as 0 for 1430 times
  • 11. while it predicts 0 as 1 for 65 times. Model does not look better prediction class value of 1 where it predicts value 1 as 1 for only 50 times while it predicts 1 as 0 for 355 times. [[1576 19] [ 296 109]] Above matrix shows the results of DT model with hyper parameter tuning which had the best accuracy for that model. With the results, it is clear that DT also predicts class value of 0 with great accuracy level while prediction of the class value 1 is pretty low. 3.6 Classification Error Rate of KNN andDT Classification error can be stated as 1 – accuracy as the simplest term. Having assessed the results of KNN and DT models in the previous sections, both had best accuracy levels 0f 76.45% and 84.25% respectively. Therefore, error rate of KNN model is 23.55% whereas DT has the error rate of 15.75%. 3.7 Precision, Recall and F1-Score of KNN and DT Precision Recall and F1-Score are measures to interpret a model performance. Precision can be stated as the ratio correctly predicted positive observations to total predicted positive observations. Recall is the ratio of correctly predicted positive observation to all observations in class ‘yes’. F1-Score is a weighted average of both precision and Recall. Figure 3.7.1. Performance measurements of optimized KNN Figure3.7.2. Performance measurements of optimized DT
  • 12. Having analysed all the performance measurements,it can be stated that Both the algorithms work better for class 0 while it is not ideal for predicting class 1 though DT have some capability of that over KNN. 4. Discussion Having assessed the results for both the classification models, Decision Tree classifier can be selected as the best model for this project with an accuracy rate of 84.25% surpassing 76.45% accuracy level of K-Nearest Neighbour classifier. It can be considered as a very high accuracy rate considering the data as well. Only drawback that can be found from the model is its inability to predict the class 1 with high accuracy. Precision value of the class confirmed it with a lower value compared to class 0. Class 1 of target feature relates to the customers who are going to churn, and this model works well to predict the customers who are not going to change this bank to another. 5. Conclusion In this report, comparative study between K-Nearest Neighbours classifier and Decision Tree classifier were considered to predict the customers who are going to move to another bank using banking data in Kaggle repository. Decision Tree model with parameter tuning was selected as the best model with the accuracy results and performance measures. Selected model lacks the ability to predict customers who are going to churn while performing well to customers who are not going to churn. This can be further improved by feeding more data related that particular class over time. 6. References [11 Hadden, J.,Tiwari, A.,Roy, R., & Ruta, D. (2007). Computer assisted customer churn management: State-of-the-art and future trends. Computers & Operations Research,34(10), 2902- 2917. [2] Sayed, H., Abdel-Fattah, M. A.,& Kholief, S. (2018). Predicting Potential Banking Customer Churn using Apache Spark ML and MLlib Packages:A Comparative Study. INTERNATIONAL JOURNAL OF ADVANCEDCOMPUTER SCIENCE ANDAPPLICATIONS,9(11), 674-677.