This presentation, delivered by a Boston Institute of Analytics student, explores the critical world of credit card fraud detection. Join us as we delve into the various methods and technologies employed to identify and prevent fraudulent activity, safeguarding both consumers and financial institutions.
The presentation covers:
The ever-evolving landscape of credit card fraud
Techniques for analyzing transaction data
Machine learning's role in fraud detection
Strategies for preventing fraudulent transactions
This session is ideal for anyone interested in data analytics, cybersecurity, or the financial services industry. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
3. What is Credit Card fraud ??
E
X
A
M
P
L
E
S
Insider Fraud
Phishing
Skimming
Identity Theft
Credit card fraud is
unauthorized use of
someone's card for
purchases, causing
financial loss and
inconvenience.
“To Predict whether the
transactions are
fraudulent or not”
4. Data Acquisition &
Description
Data preprocessing
Exploratory Data
Analysis
Data Preparation
Model Selection and
Model Training
Conclusion
Work Flow
5. 🤷♂️ No missing values.
Credit Card Fraud
Detection Dataset
2023
Dataset
Column
s
31
Rows
568630
Features
(Columns)
id : Unique identifier for each transaction
V1-V28: Anonymized features representing various transaction
attributes
(e.g., time, location, etc.)
Amount : The transaction amount
Class: Binary label indicating whether the transaction is
fraudulent(1) or not (0)
🤷♂️ No duplicates.
👍 Data type also looks fine
Data preprocessing
6. A well-structured dataset with 568,630 rows
and 31 columns, featuring no null values and
balanced distribution, provides a reliable
foundation for in-depth analyses, yielding
valuable insights across various domains
Enhanced security, reduced
financial losses, and improved
customer trust through
identification of fraudulent
credit card transactions”
7. EDA
Exploratory Data
Analysis
Exploratory Data Analysis (EDA) is a
statistical approach to analyze and visualize
data sets, helping to discover patterns,
relationships, and insights for better
understanding and decision-making
df.info()
df.info() provides
concise information
about a Data Frame,
including data types,
non-null count
and memory usage.
df.describe
()
Summarizes Data Frame statistics
like mean, standard deviation, and
quartiles, offering insights into
numerical data distribution and
central tendencies.
df.shape
Displays the
number of
rows and
columns in
Data Frame.
Df.dtype()
Shows data types of each column in Data
Frame.
8. 1:- V17 and V18 are highly co-related. 2:- V16 and V17 are highly co-related. 3:- V9 and
V10 are also positively co-related. 4:- V14 has a negative correlation with V4.
A heatmap visually represents data intensity using color variations, with
warmer colors indicating higher values and cooler colors indicating lower
values.
#Lets look data at
heatmap
paper =
plt.figure(figsize=[2
0,12])
sns.heatmap(df.cor
r(),cmap='BuPu',an
not=True)
plt.title('Correlation
Heatmap',color='re
d')
plt.show()
10. df['Amount'].plot.box()
A box plot, or box-and-whisker plot, displays the distribution of a dataset,
showing the median, quartiles, and outliers. It provides a visual summary of
central tendency and spread.
11. # Assuming 'df' is DataFrame and
'Amount' is a column in it
sns.kdeplot(data=df['Amount'], fill=True)
plt.show()
A KDE (Kernel Density Estimate) plot depicts the probability density function of a
continuous variable, smoothing data distribution visually.
Observations
: ♊ Amount is
fairly
Normally
distributed.
12. # Lets plot a histogram
paper, axes = plt.subplots(2, 2, figsize=(10, 6))
df['V1'].plot(kind='hist', ax=axes[0,0], title='Distribution of V1')
df['V10'].plot(kind='hist', ax=axes[0,1], title='Distribution of V10')
df['V12'].plot(kind='hist', ax=axes[1,0], title='Distribution of V12')
df['V23'].plot(kind='hist', ax=axes[1,1], title='Distribution of V23')
plt.suptitle('Distribution of V1,V10,V12 and V23',size=14)
plt.tight_layout()
plt.show()
13. Data Preparation
Dividing Dataset into
“X” and “Y”
Shape of X
(568630, 29)
Shape of Y
(568630) Let's standardize all our
features to bring them on
the same scale. #I have
used standard scaler
Feature Scaling
sc = StandardScaler()
x_scaled = sc.fit_transform(x)
x_scaled_df =
pd.DataFrame(x_scaled,columns=x.c
olumns)
14. Model Selection and Model Training
Dividing dataset into
Training Data and
Testing Data
# Lets Split our dataset into train and test
x_train,x_test,y_train,y_test =
train_test_split(x_scaled_df,y,test_size=0.25,random_state=15,stratify= y)
Decision Tree Model
Model Classification
report
Accuracy Score:-
99.80022228787687
17. Conclusion
•We have done Exploratory Data analysis for different
features.
•We prepared our Data and built different ML Models.
•We have seen two different models, and how they are
performing w.r.t Accuracy, Precision.
• The Decision Tree Method has a higher accuracy score
on the test dataset.
•We have created the confusion matrix to see the details of
the prediction accuracy of each model.