E-Commerce Fraud Machine Learning Models.pptx

6.53
E-COMMERCE
FRAUD-MACHINE
LEARNING
MODELS
Ximena Bustamante

INTRODUCTION
According to Statista “e-commerce losses to online payment fraud were
estimated at 41 billion U.S. dollars globally in 2022, up from the previous
year. The figure is expected to grow further to 48 billion U.S. dollars by
2023” (Statista, “Value of e-commerce losses to online payment fraud
worldwide from 2020 to 2023”)
Machine learning algorithms are often used to identify potentially
fraudulent transactions
Come explore with me two models, logistic regression and decision trees,
that were used to identify variables significantly correlated with fraud
2023 E-Commerce Fraud Machine Learning Models-Ximena
Bustamante
2

DATASET
Bustamante
3
Variables
customerEmail
Multiple
Duplicated
customerPhone
customerDevice
customerIPAddress
customerBillingAddress
No_Transactions
No_Orders
No_Payments
transactionId
orderId
paymentMethodId
paymentMethodRegistrationFailure
paymentMethodType
paymentMethodProvider
transactionAmount
transactionFailed
orderState
Fraud

KEY INSIGHTS
SIGNIFICANT VARIABLES
Bustamante
5
• The dataset consisted of 19 variable
• Out of the 18 independent variables—1 dependent variable—only 7 were found to be significant and the algorithms were run on
these.

KEY INSIGHTS
LOGISTIC REGRESSION
Bustamante
6
• A logistic regression model was created with one dependent variable (fraud: Y/N) and 7 independent variables
• It resulted in a highly accurate model according to the confusion matrix used to measure its precision
• As see on the image on the right, it resulted in an 88% accuracy, 85% sensitivity
91% specificity, 90% precision and 87% negative predictive value
• Out of 65 non-fraud transactions in the test data, it correctly identified 59
• Out of 64 fraud transactions in the test data, it correctly identified 55

KEY INSIGHTS
DECISION TREES
Bustamante
7
• A decision tree model was also created with the same dependent and independent
• It also resulted in a highly accurate algorithm according to the confusion matrix used to measure its precision
• This model resulted in a 96% sensitivity
83% specificity, 85% pos predictive and 95% negative predictive value
• Out of a total of 260 non-fraud transactions, it correctly identified 249
• Out of a total of 257 fraud transactions, it correctly identified 213

DATA PROCESS-ACQUISITION,
PREPARATION, ANALYSIS AND
VISUALIZATION

DATA ACQUISITION, PREPARATION AND
ANALYSIS
EXCEL & ACCES
Bustamante
9
• Data was acquired from Kaggle and analysis was conducted with inspiration from University of Illinois –Urbana Champaign
Professor Hudson (Machine Learning Algorithms with R in Business Analytics)
• Tables with transaction data and customer data were initially joined in Acces and then explored in Excel
• Initial exploration of the data led to the identification of multiple customer e-mails associated to one customer
• This led to a new variable of binomial values being created to reflect transactions for customers with MULTIPLE emails

ANALYSIS
POWER BI
Bustamante
10
• Power BI-Power Query was used to conduct more in-depth analysis of the variables
• Based on “Column Distribution”, it was evident that some IP addresses, devices and billing addresses were being used by multiple
customers (DUPLICATED)
• Thus, a new “Duplicated” column was created to reflect these transactions

ANALYSIS
R STUDIO
Bustamante
11
• R Studio was used to create the 2 Machine Learning (ML) algorithms
• For the complete code, please visit my GitHub repository
• To create both ML models, I uploaded the necessary libraries, converted strings to factors, created confusion matrix, visualized the
balance of the dataset, split the data into training and testing sets, trained the models and the evaluated them on the test data,
made predictions, and finally used confusion matrix to measure accuracy

DATA VISUALIZATION
POWER BI
Bustamante
12
• Power BI was used to create a map to show the geographical location of all transactions, color coded by fraud and non-fraud

CHALLENGES AND COOL TECHNIQUES

CHALLENGES AND COOL TECHNIQUES
Bustamante
14
• Challenge: High number of correlated variables
• Cool Technique: Feature engineering--created two columns (with binomial values) to reflect transactions that had
duplicated/multiple addresses, phone numbers and Ip addresses, instead of creating one column for
• Challenge: Unbalanced dataset
• Cool Technique: Balanced it using RUS (random under sampling) to create a dataset with roughly the same amount of fraud/non-
fraud transactions
What If I had More Time?
• If I had more time, I would have done social networking to see how transactions may associate to one another

THANK YOU FOR
CHECKING OUT MY
PROJECT!
 Follow me for more project ideas
 If you have any questions, comments, feedback, JOB OFFERS , feel free to DM me
2023 E-Commerce Fraud Machine Learning
Models-Ximena Bustamante
15

E-Commerce Fraud Machine Learning Models.pptx

Recommended

Recommended

More Related Content

Similar to E-Commerce Fraud Machine Learning Models.pptx

Similar to E-Commerce Fraud Machine Learning Models.pptx (20)

Recently uploaded

Recently uploaded (20)

E-Commerce Fraud Machine Learning Models.pptx