According to Statista “e-commerce losses to online payment fraud were estimated at 41 billion U.S. dollars globally in 2022, up from the previous year. The figure is expected to grow further to 48 billion U.S. dollars by 2023” (Statista, “Value of e-commerce losses to online payment fraud worldwide from 2020 to 2023”)
Machine learning algorithms are often used to identify potentially fraudulent transactions
Come explore with me two models, logistic regression and decision trees, that were used to identify variables significantly correlated with fraud
2. INTRODUCTION
According to Statista “e-commerce losses to online payment fraud were
estimated at 41 billion U.S. dollars globally in 2022, up from the previous
year. The figure is expected to grow further to 48 billion U.S. dollars by
2023” (Statista, “Value of e-commerce losses to online payment fraud
worldwide from 2020 to 2023”)
Machine learning algorithms are often used to identify potentially
fraudulent transactions
Come explore with me two models, logistic regression and decision trees,
that were used to identify variables significantly correlated with fraud
2023 E-Commerce Fraud Machine Learning Models-Ximena
Bustamante
2
5. KEY INSIGHTS
SIGNIFICANT VARIABLES
2023 E-Commerce Fraud Machine Learning Models-Ximena
Bustamante
5
• The dataset consisted of 19 variable
• Out of the 18 independent variables—1 dependent variable—only 7 were found to be significant and the algorithms were run on
these.
6. KEY INSIGHTS
LOGISTIC REGRESSION
2023 E-Commerce Fraud Machine Learning Models-Ximena
Bustamante
6
• A logistic regression model was created with one dependent variable (fraud: Y/N) and 7 independent variables
• It resulted in a highly accurate model according to the confusion matrix used to measure its precision
• As see on the image on the right, it resulted in an 88% accuracy, 85% sensitivity
91% specificity, 90% precision and 87% negative predictive value
• Out of 65 non-fraud transactions in the test data, it correctly identified 59
• Out of 64 fraud transactions in the test data, it correctly identified 55
7. KEY INSIGHTS
DECISION TREES
2023 E-Commerce Fraud Machine Learning Models-Ximena
Bustamante
7
• A decision tree model was also created with the same dependent and independent
• It also resulted in a highly accurate algorithm according to the confusion matrix used to measure its precision
• This model resulted in a 96% sensitivity
83% specificity, 85% pos predictive and 95% negative predictive value
• Out of a total of 260 non-fraud transactions, it correctly identified 249
• Out of a total of 257 fraud transactions, it correctly identified 213
9. DATA ACQUISITION, PREPARATION AND
ANALYSIS
EXCEL & ACCES
2023 E-Commerce Fraud Machine Learning Models-Ximena
Bustamante
9
• Data was acquired from Kaggle and analysis was conducted with inspiration from University of Illinois –Urbana Champaign
Professor Hudson (Machine Learning Algorithms with R in Business Analytics)
• Tables with transaction data and customer data were initially joined in Acces and then explored in Excel
• Initial exploration of the data led to the identification of multiple customer e-mails associated to one customer
• This led to a new variable of binomial values being created to reflect transactions for customers with MULTIPLE emails
10. DATA ACQUISITION, PREPARATION AND
ANALYSIS
POWER BI
2023 E-Commerce Fraud Machine Learning Models-Ximena
Bustamante
10
• Power BI-Power Query was used to conduct more in-depth analysis of the variables
• Based on “Column Distribution”, it was evident that some IP addresses, devices and billing addresses were being used by multiple
customers (DUPLICATED)
• Thus, a new “Duplicated” column was created to reflect these transactions
11. DATA ACQUISITION, PREPARATION AND
ANALYSIS
R STUDIO
2023 E-Commerce Fraud Machine Learning Models-Ximena
Bustamante
11
• R Studio was used to create the 2 Machine Learning (ML) algorithms
• For the complete code, please visit my GitHub repository
• To create both ML models, I uploaded the necessary libraries, converted strings to factors, created confusion matrix, visualized the
balance of the dataset, split the data into training and testing sets, trained the models and the evaluated them on the test data,
made predictions, and finally used confusion matrix to measure accuracy
12. DATA VISUALIZATION
POWER BI
2023 E-Commerce Fraud Machine Learning Models-Ximena
Bustamante
12
• Power BI was used to create a map to show the geographical location of all transactions, color coded by fraud and non-fraud
14. CHALLENGES AND COOL TECHNIQUES
2023 E-Commerce Fraud Machine Learning Models-Ximena
Bustamante
14
• Challenge: High number of correlated variables
• Cool Technique: Feature engineering--created two columns (with binomial values) to reflect transactions that had
duplicated/multiple addresses, phone numbers and Ip addresses, instead of creating one column for
• Challenge: Unbalanced dataset
• Cool Technique: Balanced it using RUS (random under sampling) to create a dataset with roughly the same amount of fraud/non-
fraud transactions
What If I had More Time?
• If I had more time, I would have done social networking to see how transactions may associate to one another
15. THANK YOU FOR
CHECKING OUT MY
PROJECT!
Follow me for more project ideas
If you have any questions, comments, feedback, JOB OFFERS , feel free to DM me
2023 E-Commerce Fraud Machine Learning
Models-Ximena Bustamante
15