Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud Detection

CREDIT CARD
FRAUD Detection
2023
By
Shivam Tiwari

What is Credit Card fraud ??
E
X
A
M
P
L
E
S
Insider Fraud
Phishing
Skimming
Identity Theft
Credit card fraud is
unauthorized use of
someone's card for
purchases, causing
financial loss and
inconvenience.
“To Predict whether the
transactions are
fraudulent or not”

Data Acquisition &
Description
Data preprocessing
Exploratory Data
Analysis
Data Preparation
Model Selection and
Model Training
Conclusion
Work Flow

🤷♂️ No missing values.
Credit Card Fraud
Detection Dataset
2023
Dataset
Column
s
31
Rows
568630
Features
(Columns)
id : Unique identifier for each transaction
V1-V28: Anonymized features representing various transaction
attributes
(e.g., time, location, etc.)
Amount : The transaction amount
Class: Binary label indicating whether the transaction is
fraudulent(1) or not (0)
🤷♂️ No duplicates.
👍 Data type also looks fine
Data preprocessing

A well-structured dataset with 568,630 rows
and 31 columns, featuring no null values and
balanced distribution, provides a reliable
foundation for in-depth analyses, yielding
valuable insights across various domains
Enhanced security, reduced
financial losses, and improved
customer trust through
identification of fraudulent
credit card transactions”

EDA
Exploratory Data
Analysis
Exploratory Data Analysis (EDA) is a
statistical approach to analyze and visualize
data sets, helping to discover patterns,
relationships, and insights for better
understanding and decision-making
df.info()
df.info() provides
concise information
about a Data Frame,
including data types,
non-null count
and memory usage.
df.describe
()
Summarizes Data Frame statistics
like mean, standard deviation, and
quartiles, offering insights into
numerical data distribution and
central tendencies.
df.shape
Displays the
number of
rows and
columns in
Data Frame.
Df.dtype()
Shows data types of each column in Data
Frame.

1:- V17 and V18 are highly co-related. 2:- V16 and V17 are highly co-related. 3:- V9 and
V10 are also positively co-related. 4:- V14 has a negative correlation with V4.
A heatmap visually represents data intensity using color variations, with
warmer colors indicating higher values and cooler colors indicating lower
values.
#Lets look data at
heatmap
paper =
plt.figure(figsize=[2
0,12])
sns.heatmap(df.cor
r(),cmap='BuPu',an
not=True)
plt.title('Correlation
Heatmap',color='re
d')
plt.show()

df.skew()
id -6.579536e-16 V1
-8.341717e-02 V2 -
1.397952e+00 V3
1.462221e-02 V4 -
4.416893e-02 V5
1.506414e+00 V6 -
2.016110e-01 V7
1.902687e+01 V8
2.999722e-01 V9
1.710575e-01 V10
7.404136e-01 V11 -
2.089056e-02 V12
6.675895e-02 V13
1.490639e-02 V14
2.078348e-01 V15
1.123298e-02 V16
2.664070e-01 V17
3.730610e-01 V18
1.291911e-01 V19 -
1.017123e-02 V20 -
1.556460e+00 V21 -
1.089833e-01 V22
3.185295e-01 V23 -
9.968746e-02 V24
6.608974e-02 V25
2.300804e-02 V26 -
1.895874e-02 V27
2.755452e+00 V28
1.724978e+00 Amount
1.655585e-03 Class
0.000000e+00 dtype:
float64
Observations
(●'◡’●):--
Features like
V1,V23 are
highly
negatively
skewed.
plt.figure(figsize=(6, 4)) # Adjust the figure
size as needed
sns.countplot(x='Class', data=df)
plt.xlabel('Class')
plt.ylabel('Count')
plt.title('Distribution of Class')
plt.show()

df['Amount'].plot.box()
A box plot, or box-and-whisker plot, displays the distribution of a dataset,
showing the median, quartiles, and outliers. It provides a visual summary of
central tendency and spread.

# Assuming 'df' is DataFrame and
'Amount' is a column in it
sns.kdeplot(data=df['Amount'], fill=True)
plt.show()
A KDE (Kernel Density Estimate) plot depicts the probability density function of a
continuous variable, smoothing data distribution visually.
Observations
: ♊ Amount is
fairly
Normally
distributed.

# Lets plot a histogram
paper, axes = plt.subplots(2, 2, figsize=(10, 6))
df['V1'].plot(kind='hist', ax=axes[0,0], title='Distribution of V1')
plt.suptitle('Distribution of V1,V10,V12 and V23',size=14)
plt.tight_layout()
plt.show()

Data Preparation
Dividing Dataset into
“X” and “Y”
Shape of X
(568630, 29)
Shape of Y
(568630) Let's standardize all our
features to bring them on
the same scale. #I have
used standard scaler
Feature Scaling
sc = StandardScaler()
x_scaled = sc.fit_transform(x)
x_scaled_df =
pd.DataFrame(x_scaled,columns=x.c
olumns)

Model Selection and Model Training
Dividing dataset into
Training Data and
Testing Data
# Lets Split our dataset into train and test
x_train,x_test,y_train,y_test =
train_test_split(x_scaled_df,y,test_size=0.25,random_state=15,stratify= y)
Decision Tree Model
Model Classification
report
Accuracy Score:-
99.80022228787687

Random Forest Classifier
Model
Classification
report
Accuracy Score:-
96.44480085538626
Logistic regression
Model Classification
report
Accuracy Score:-
99.98454426173694

ALGORITHM ACCURA
CY
CONFUSION MATRIX
CLASSIFICATTION
REPORT
Decision Tree
Model
99.800222
%
Random Forest
Classifier
99.984544
2%
Logistic
Regression
96.444800
8%

Conclusion
•We have done Exploratory Data analysis for different
features.
•We prepared our Data and built different ML Models.
•We have seen two different models, and how they are
performing w.r.t Accuracy, Precision.
• The Decision Tree Method has a higher accuracy score
on the test dataset.
•We have created the confusion matrix to see the details of
the prediction accuracy of each model.

Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud Detection

Recomendados

Recomendados

Más contenido relacionado

Más de Boston Institute of Analytics

Más de Boston Institute of Analytics (20)

Último

Último (20)

Combating Fraudulent Transactions: A Deep Dive into Credit Card Fraud Detection