SlideShare una empresa de Scribd logo
1 de 32
Fraud Detection by
Stacking Cost-Sensitive
Decision Trees
Alejandro Correa Bahnsen, PhD
Chief Data Scientist & Head of Research
acorrea@easysol.net
Who am I?
Chief Data Scientist at Easy Solutions
Industrial Engineer
PhD in Machine Learning from Luxembourg University
Scikit-Learn contributor
Organizer of Science Bogota Meetups
2
AboutEasySolutions®
3
A leading global provider of electronic fraud
prevention for financial institutions and
enterprise customers
430+ customers
In 30 countries
115 million
Users protected
30 billion
Online connections monitored
Industry recognition
TotalFraudProtection
Discuss what makes a data science project successful
4
5
Risk Based Authentication Phishing URL Classification Phishing Brand ID
Fraud Detection
19h risk = 10
9h risk = 95
HTML Injection Biometrics
Research/ DataScienceSpectrum
6
• Maybe someday, someone can use this
Basic
Research
• I might be able to use this
Applied
Research
• I can use this (sometimes)
Working
Prototype
• Software engineers can use thisQuality Code
• People can use this
Tool or
Service
Innovation
practicality
Credit Card Fraud Detection
7
Estimate the probability of a transaction being fraud based on analyzing
customer patterns and recent fraudulent behavior
Issues when constructing a fraud detection system:
• Skewness of the data
• Cost-sensitivity
• Short time response of the system
• Dimensionality of the search space
• Feature preprocessing
• Model selection
Creditcardfrauddetection
8
Network
Fraud??
9
• Larger European card processing
company
• 2012 & 2013 card present
transactions
• 20MM Transactions
• 40,000 Frauds
• 0.467% Fraud rate
• ~ 2MM EUR lost due to fraud on
test dataset
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
Test
Train
Data
Raw features
11
Attribute name Description
Transaction ID Transaction identification number
Time Date and time of the transaction
Account number Identification number of the customer
Card number Identification of the credit card
Transaction type ie. Internet, ATM, POS, ...
Entry mode ie. Chip and pin, magnetic stripe, ...
Amount Amount of the transaction in Euros
Merchant code Identification of the merchant type
Merchant group Merchant group identification
Country Country of trx
Country 2 Country of residence
Type of card ie. Visa debit, Mastercard, American Express...
Gender Gender of the card holder
Age Card holder age
Bank Issuer bank of the card
Features
Credit card fraud detection is a cost-sensitive problem. As the cost due to a
false positive is different than the cost of a false negative.
• False positives: When predicting a transaction as fraudulent, when in
fact it is not a fraud, there is an administrative cost that is incurred by
the financial institution.
• False negatives: Failing to detect a fraud, the amount of that transaction
is lost.
Moreover, it is not enough to assume a constant cost difference between
false positives and false negatives, as the amount of the transactions varies
quite significantly.
12
FinancialEvaluation
Cost matrix
𝐶𝑜𝑠𝑡 𝑓 𝑆 =
𝑖=1
𝑁
𝑦𝑖 𝑐𝑖 𝐶 𝑇𝑃 𝑖
+ 1 − 𝑐𝑖 𝐶 𝐹𝑁 𝑖
+ 1 − 𝑦𝑖 𝑐𝑖 𝐶 𝐹𝑃 𝑖
+ 1 − 𝑐𝑖 𝐶 𝑇𝑁 𝑖
13
Actual Positive
𝒚𝒊 = 𝟏
Actual Negative
𝒚𝒊 = 𝟎
Predicted Positive
𝒄𝒊 = 𝟏
𝐶 𝑇𝑃 𝑖
= 𝐶 𝑎 𝐶 𝐹𝑃 𝑖
= 𝐶 𝑎
Predicted Negative
𝒄𝒊 = 𝟎
𝐶 𝐹𝑁 𝑖
= 𝐴𝑚𝑡𝑖 𝐶 𝑇𝑁 𝑖
= 0
FinancialEvaluation
• Cost Proportionate Sampling
• Bayes minimum risk
• Cost-sensitive logistic regression
• Cost-sensitive decision trees
• Stacking Cost-sensitive decision trees
CostSensitiveAlgorithms
14
CostProportionateSampling
Normalized Cost weight
𝑤𝑖 =
𝐶 𝐹𝑃 𝑖 𝑖𝑓 𝑦𝑖 = 0
𝐶 𝐹𝑁 𝑖 𝑖𝑓 𝑦𝑖 = 1
𝑤𝑖 =
𝑤𝑖
max
𝑗
𝑤𝑗
CostProportionateSampling
Cost Proportionate Over Sampling
Example 𝑦𝑖 𝑤𝑖
1 0 1
2 1 10
3 0 2
4 1 20
5 0 1
Initial
Dataset
(1,0,1)
(2,1,10)
(3,0,2)
(4,1,20)
(5,0,1)
Cost Proportionate Dataset
(1,0,1)
(2,1,1), (2,1,1), …, (2,1,1)
(3,0,2), (3,0,2)
(4,1,1), (4,1,1), (4,1,1), …,
(4,1,1), (4,1,1)
(5,0,1)
*Elkan, C. (2001). The Foundations of Cost-Sensitive Learning.
CostProportionateSampling
Cost Proportionate Rejection Sampling
Example 𝑦𝑖 𝑤𝑖
1 0 1
2 1 10
3 0 2
4 1 20
5 0 1
Cost
Proportion
ate Dataset
(2,1,1)
(4,1,1)
(4,1,1)
(5,0,1)
*Zadrozny et al. (2003). Cost-sensitive learning by cost-proportionate example weighting.
𝑤𝑖/max( 𝑤𝑖)
0.05
0.5
0.1
1
0.05
Initial
Dataset
(1,0,1)
(2,1,10)
(3,0,2)
(4,1,20)
(5,0,1)
Decision model based on quantifying tradeoffs between various decisions
using probabilities and the costs that accompany such decisions
Risk of classification
𝑅 𝑐𝑖 = 0|𝑥𝑖 = 𝐶 𝑇𝑁 𝑖 1 − 𝑝𝑖 + 𝐶 𝐹𝑁 𝑖 ∙ 𝑝𝑖
𝑅 𝑐𝑖 = 1|𝑥𝑖 = 𝐶 𝐹𝑃 𝑖 1 − 𝑝𝑖 + 𝐶 𝑇𝑃 𝑖 ∙ 𝑝𝑖
Using the different risks the prediction is made based on the following
condition:
𝑐𝑖 =
0 𝑅 𝑐𝑖 = 0|𝑥𝑖 ≤ 𝑅 𝑐𝑖 = 1|𝑥𝑖
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
18
BayesMinimumRisk
• Logistic Regression Model
• Cost Function
• Cost Analysis
Cost-Sensitive- LogisticRegression
• Actual Costs
• Cost-Sensitive Function
Cost-Sensitive- LogisticRegression
21
Proposed Cost based impurity measure
𝑆 𝑙
= 𝑥|𝑥𝑖 ∈ 𝑆 ∧ 𝑥𝑖
𝑗
≤ 𝑙 𝑚
𝑗
𝑆 𝑟
= 𝑥|𝑥𝑖 ∈ 𝑆 ∧ 𝑥𝑖
𝑗
> 𝑙 𝑚
𝑗
• The impurity of each leaf is calculated using:
𝐼𝑐 𝑆 = min 𝐶𝑜𝑠𝑡 𝑓0 𝑆 , 𝐶𝑜𝑠𝑡 𝑓1 𝑆
𝑓 𝑆 =
0 𝑖𝑓 𝐶𝑜𝑠𝑡 𝑓0 𝑆 ≤ 𝐶𝑜𝑠𝑡 𝑓1 𝑆
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• Afterwards the gain of applying a given rule to the set 𝑆 is:
𝐺𝑎𝑖𝑛 𝑐 𝑥 𝑗, 𝑙 𝑚
𝑗
= 𝐼𝑐 𝜋1 − 𝐼𝑐 𝜋1
𝑙
+ 𝐼𝑐 𝜋1
𝑟
S
S S
𝑥 𝑗, 𝑙 𝑚
𝑗
Cost-SensitiveDecisionTrees
22
Decision trees construction
• The rule that maximizes the gain is selected
𝑏𝑒𝑠𝑡 𝑥, 𝑏𝑒𝑠𝑡𝑙 = 𝑎𝑟𝑔 max
𝑗,𝑚
𝐺𝑎𝑖𝑛 𝑥 𝑗, 𝑙 𝑚
𝑗
S
S S
S S S S
S S S S
• The process is repeated until a stopping criteria is met:
Cost-SensitiveDecisionTrees
23
Proposed cost-sensitive pruning criteria
• Calculation of the Tree savings and pruned Tree savings
S
S S
S S S S
S S S S
𝑃𝐶𝑐 =
𝐶𝑜𝑠𝑡 𝑓 𝑆, 𝑇𝑟𝑒𝑒 − 𝐶𝑜𝑠𝑡 𝑓 𝑆, 𝐸𝐵 𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑛𝑐ℎ
𝑇𝑟𝑒𝑒 − 𝐸𝐵 𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑛𝑐ℎ
• After calculating the pruning criteria for all possible trees. The maximum
improvement is selected and the Tree is pruned.
• Later the process is repeated until there is no further improvement.
S
S S
S S S S
S S
S
S S
S S
Cost-SensitiveDecisionTrees
Typical ensemble is made by combining T different base classifiers. Each
base classifiers is trained by applying algorithm M in a random subset
24
𝑀𝑗 ← 𝑀 𝑆𝑗 ∀𝑗 ∈ 1, … , 𝑇
EnsembleCost-SensitiveDecisionTrees
25
1
2
3
4
5
6
7
8
8
6
2
5
2
1
3
6
7
1
2
3
8
1
5
8
1
4
4
2
1
9
4
6
1
1
5
8
1
4
4
2
1
1
5
8
1
4
4
2
1
1
5
8
1
4
4
2
1
Bagging Pasting Random forest Random patches
Training set
EnsembleCost-SensitiveDecisionTrees
After the base classifiers are constructed they are typically combined using
one of the following methods:
• Majority voting
𝐻 𝑆 = 𝑓𝑚𝑣 𝑆, 𝑀 = 𝑎𝑟𝑔 max
𝑐∈ 0,1
𝑗=1
𝑇
1 𝑐 𝑀𝑗 𝑆
26
EnsembleCost-SensitiveDecisionTrees
• Proposed cost-sensitive stacking
𝐻 𝑆 = 𝑓𝑠 𝑆, 𝑀, 𝛽 =
1
1 + 𝑒
− 𝑗=1
𝑇
𝛽 𝑗 𝑀 𝑗 𝑆
Using the cost-sensitive logistic regression [Correa et. al, 2014] model:
𝐽 𝑆, 𝑀, 𝛽 =
𝑖=1
𝑁
𝑦𝑖 𝑓𝑠 𝑆, 𝑀, 𝛽 𝐶 𝑇𝑃 𝑖
− 𝐶 𝐹𝑁𝑖
+ 𝐶 𝐹𝑁 𝑖
+
1 − 𝑦𝑖 𝑓𝑠 𝑆, 𝑀, 𝛽 𝐶 𝐹𝑃 𝑖
− 𝐶 𝑇𝑁 𝑖
+ 𝐶 𝑇𝑁 𝑖
Then the weights are estimated using
𝛽 = 𝑎𝑟𝑔 min
𝛽
𝐽 𝑆, 𝑀, 𝛽
27
EnsembleCost-SensitiveDecisionTrees
0%
10%
20%
30%
40%
50%
60%
70%
80%
Expert
Rules
Random
Forests
RF CS
Sampling
CS Logistic
Regression
CS
Decision
Tree
Ensemble
CSDT
Majority
Ensemble
CSDT
Stacking
% Savings F1-Score
Results
28
Costcla- Library
29
Costcla- Library
30
• New framework for stacking of example
dependent cost-sensitive decision trees
• Models should be evaluated taking into
account real financial costs of the application
• Algorithms should be developed to
incorporate those financial costs
Conclusions
31
Any questions or comments, please let me know.
Alejandro Correa Bahnsen, PhD
Chief Data Scientist & Head of Research
acorrea@easysol.net
Thank you!

Más contenido relacionado

La actualidad más candente

Creditscore
CreditscoreCreditscore
Creditscore
kevinlan
 
CreditCardDefaultModel
CreditCardDefaultModelCreditCardDefaultModel
CreditCardDefaultModel
Andrew Rogala
 
Statistical Models for Proportional Outcomes
Statistical Models for Proportional OutcomesStatistical Models for Proportional Outcomes
Statistical Models for Proportional Outcomes
WenSui Liu
 
TransactionBasedAnalytics2010
TransactionBasedAnalytics2010TransactionBasedAnalytics2010
TransactionBasedAnalytics2010
Vijay Desai
 

La actualidad más candente (20)

Maximizing a churn campaign’s profitability with cost sensitive predictive an...
Maximizing a churn campaign’s profitability with cost sensitive predictive an...Maximizing a churn campaign’s profitability with cost sensitive predictive an...
Maximizing a churn campaign’s profitability with cost sensitive predictive an...
 
Credit risk scoring model final
Credit risk scoring model finalCredit risk scoring model final
Credit risk scoring model final
 
Creditscore
CreditscoreCreditscore
Creditscore
 
Big Data solution for multi-national Bank
Big Data solution for multi-national BankBig Data solution for multi-national Bank
Big Data solution for multi-national Bank
 
CreditCardDefaultModel
CreditCardDefaultModelCreditCardDefaultModel
CreditCardDefaultModel
 
Credit card fraud detection through machine learning
Credit card fraud detection through machine learningCredit card fraud detection through machine learning
Credit card fraud detection through machine learning
 
Consumer credit-risk3440
Consumer credit-risk3440Consumer credit-risk3440
Consumer credit-risk3440
 
Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients
 
Default Credit Card Prediction
Default Credit Card PredictionDefault Credit Card Prediction
Default Credit Card Prediction
 
Credit Risk Analytics
Credit Risk AnalyticsCredit Risk Analytics
Credit Risk Analytics
 
Statistical Models for Proportional Outcomes
Statistical Models for Proportional OutcomesStatistical Models for Proportional Outcomes
Statistical Models for Proportional Outcomes
 
Taiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detectionTaiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detection
 
TransactionBasedAnalytics2010
TransactionBasedAnalytics2010TransactionBasedAnalytics2010
TransactionBasedAnalytics2010
 
CECL Project Overview
CECL Project OverviewCECL Project Overview
CECL Project Overview
 
Computational Finance Introductory Lecture
Computational Finance Introductory LectureComputational Finance Introductory Lecture
Computational Finance Introductory Lecture
 
Challenges in Computational Finance
Challenges in Computational FinanceChallenges in Computational Finance
Challenges in Computational Finance
 
Decision tree example problem
Decision tree example problemDecision tree example problem
Decision tree example problem
 
A Review on Credit Card Default Modelling using Data Science
A Review on Credit Card Default Modelling using Data ScienceA Review on Credit Card Default Modelling using Data Science
A Review on Credit Card Default Modelling using Data Science
 
An introduction to decision trees
An introduction to decision treesAn introduction to decision trees
An introduction to decision trees
 
Mining Credit Card Defults
Mining Credit Card DefultsMining Credit Card Defults
Mining Credit Card Defults
 

Similar a Fraud Detection by Stacking Cost-Sensitive Decision Trees

An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
butest
 
A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...
Yao Wu
 

Similar a Fraud Detection by Stacking Cost-Sensitive Decision Trees (20)

Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms
 
Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922Summer 07-mfin7011-tang1922
Summer 07-mfin7011-tang1922
 
Accurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterAccurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - Poster
 
Big Data Analytics.pptx
Big Data Analytics.pptxBig Data Analytics.pptx
Big Data Analytics.pptx
 
Statistics in real life engineering
Statistics in real life engineeringStatistics in real life engineering
Statistics in real life engineering
 
Data Driven Risk Management
Data Driven Risk ManagementData Driven Risk Management
Data Driven Risk Management
 
Primer on major data mining algorithms
Primer on major data mining algorithmsPrimer on major data mining algorithms
Primer on major data mining algorithms
 
Mini datathon
Mini datathonMini datathon
Mini datathon
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Machine Learning techniques used in AI.
Machine Learning  techniques used in AI.Machine Learning  techniques used in AI.
Machine Learning techniques used in AI.
 
A predictive system for detection of bankruptcy using machine learning techni...
A predictive system for detection of bankruptcy using machine learning techni...A predictive system for detection of bankruptcy using machine learning techni...
A predictive system for detection of bankruptcy using machine learning techni...
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Machine learning techniques in fraud prevention
Machine learning techniques in fraud preventionMachine learning techniques in fraud prevention
Machine learning techniques in fraud prevention
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
 
A high level overview of all that is Analytics
A high level overview of all that is AnalyticsA high level overview of all that is Analytics
A high level overview of all that is Analytics
 
A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...A General Framework for Accurate and Fast Regression by Data Summarization in...
A General Framework for Accurate and Fast Regression by Data Summarization in...
 
Decision theory
Decision theoryDecision theory
Decision theory
 

Más de Alejandro Correa Bahnsen, PhD

Fraud analytics detección y prevención de fraudes en la era del big data sl...
Fraud analytics detección y prevención de fraudes en la era del big data   sl...Fraud analytics detección y prevención de fraudes en la era del big data   sl...
Fraud analytics detección y prevención de fraudes en la era del big data sl...
Alejandro Correa Bahnsen, PhD
 
Analytics - compitiendo en la era de la informacion
Analytics - compitiendo en la era de la informacionAnalytics - compitiendo en la era de la informacion
Analytics - compitiendo en la era de la informacion
Alejandro Correa Bahnsen, PhD
 

Más de Alejandro Correa Bahnsen, PhD (11)

black hat deephish
black hat deephishblack hat deephish
black hat deephish
 
DeepPhish: Simulating malicious AI
DeepPhish: Simulating malicious AIDeepPhish: Simulating malicious AI
DeepPhish: Simulating malicious AI
 
AI vs. AI: Can Predictive Models Stop the Tide of Hacker AI?
AI vs. AI: Can Predictive Models Stop the Tide of Hacker AI?AI vs. AI: Can Predictive Models Stop the Tide of Hacker AI?
AI vs. AI: Can Predictive Models Stop the Tide of Hacker AI?
 
How I Learned to Stop Worrying and Love Building Data Products
How I Learned to Stop Worrying and Love Building Data ProductsHow I Learned to Stop Worrying and Love Building Data Products
How I Learned to Stop Worrying and Love Building Data Products
 
Classifying Phishing URLs Using Recurrent Neural Networks
Classifying Phishing URLs Using Recurrent Neural NetworksClassifying Phishing URLs Using Recurrent Neural Networks
Classifying Phishing URLs Using Recurrent Neural Networks
 
Demystifying machine learning using lime
Demystifying machine learning using limeDemystifying machine learning using lime
Demystifying machine learning using lime
 
Modern Data Science
Modern Data ScienceModern Data Science
Modern Data Science
 
Fraud analytics detección y prevención de fraudes en la era del big data sl...
Fraud analytics detección y prevención de fraudes en la era del big data   sl...Fraud analytics detección y prevención de fraudes en la era del big data   sl...
Fraud analytics detección y prevención de fraudes en la era del big data sl...
 
Analytics - compitiendo en la era de la informacion
Analytics - compitiendo en la era de la informacionAnalytics - compitiendo en la era de la informacion
Analytics - compitiendo en la era de la informacion
 
2013 credit card fraud detection why theory dosent adjust to practice
2013 credit card fraud detection why theory dosent adjust to practice2013 credit card fraud detection why theory dosent adjust to practice
2013 credit card fraud detection why theory dosent adjust to practice
 
2011 advanced analytics through the credit cycle
2011 advanced analytics through the credit cycle2011 advanced analytics through the credit cycle
2011 advanced analytics through the credit cycle
 

Último

怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 

Último (20)

Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 

Fraud Detection by Stacking Cost-Sensitive Decision Trees

  • 1. Fraud Detection by Stacking Cost-Sensitive Decision Trees Alejandro Correa Bahnsen, PhD Chief Data Scientist & Head of Research acorrea@easysol.net
  • 2. Who am I? Chief Data Scientist at Easy Solutions Industrial Engineer PhD in Machine Learning from Luxembourg University Scikit-Learn contributor Organizer of Science Bogota Meetups 2
  • 3. AboutEasySolutions® 3 A leading global provider of electronic fraud prevention for financial institutions and enterprise customers 430+ customers In 30 countries 115 million Users protected 30 billion Online connections monitored Industry recognition
  • 4. TotalFraudProtection Discuss what makes a data science project successful 4
  • 5. 5 Risk Based Authentication Phishing URL Classification Phishing Brand ID Fraud Detection 19h risk = 10 9h risk = 95 HTML Injection Biometrics
  • 6. Research/ DataScienceSpectrum 6 • Maybe someday, someone can use this Basic Research • I might be able to use this Applied Research • I can use this (sometimes) Working Prototype • Software engineers can use thisQuality Code • People can use this Tool or Service Innovation practicality
  • 7. Credit Card Fraud Detection 7
  • 8. Estimate the probability of a transaction being fraud based on analyzing customer patterns and recent fraudulent behavior Issues when constructing a fraud detection system: • Skewness of the data • Cost-sensitivity • Short time response of the system • Dimensionality of the search space • Feature preprocessing • Model selection Creditcardfrauddetection 8
  • 10. • Larger European card processing company • 2012 & 2013 card present transactions • 20MM Transactions • 40,000 Frauds • 0.467% Fraud rate • ~ 2MM EUR lost due to fraud on test dataset Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan Test Train Data
  • 11. Raw features 11 Attribute name Description Transaction ID Transaction identification number Time Date and time of the transaction Account number Identification number of the customer Card number Identification of the credit card Transaction type ie. Internet, ATM, POS, ... Entry mode ie. Chip and pin, magnetic stripe, ... Amount Amount of the transaction in Euros Merchant code Identification of the merchant type Merchant group Merchant group identification Country Country of trx Country 2 Country of residence Type of card ie. Visa debit, Mastercard, American Express... Gender Gender of the card holder Age Card holder age Bank Issuer bank of the card Features
  • 12. Credit card fraud detection is a cost-sensitive problem. As the cost due to a false positive is different than the cost of a false negative. • False positives: When predicting a transaction as fraudulent, when in fact it is not a fraud, there is an administrative cost that is incurred by the financial institution. • False negatives: Failing to detect a fraud, the amount of that transaction is lost. Moreover, it is not enough to assume a constant cost difference between false positives and false negatives, as the amount of the transactions varies quite significantly. 12 FinancialEvaluation
  • 13. Cost matrix 𝐶𝑜𝑠𝑡 𝑓 𝑆 = 𝑖=1 𝑁 𝑦𝑖 𝑐𝑖 𝐶 𝑇𝑃 𝑖 + 1 − 𝑐𝑖 𝐶 𝐹𝑁 𝑖 + 1 − 𝑦𝑖 𝑐𝑖 𝐶 𝐹𝑃 𝑖 + 1 − 𝑐𝑖 𝐶 𝑇𝑁 𝑖 13 Actual Positive 𝒚𝒊 = 𝟏 Actual Negative 𝒚𝒊 = 𝟎 Predicted Positive 𝒄𝒊 = 𝟏 𝐶 𝑇𝑃 𝑖 = 𝐶 𝑎 𝐶 𝐹𝑃 𝑖 = 𝐶 𝑎 Predicted Negative 𝒄𝒊 = 𝟎 𝐶 𝐹𝑁 𝑖 = 𝐴𝑚𝑡𝑖 𝐶 𝑇𝑁 𝑖 = 0 FinancialEvaluation
  • 14. • Cost Proportionate Sampling • Bayes minimum risk • Cost-sensitive logistic regression • Cost-sensitive decision trees • Stacking Cost-sensitive decision trees CostSensitiveAlgorithms 14
  • 15. CostProportionateSampling Normalized Cost weight 𝑤𝑖 = 𝐶 𝐹𝑃 𝑖 𝑖𝑓 𝑦𝑖 = 0 𝐶 𝐹𝑁 𝑖 𝑖𝑓 𝑦𝑖 = 1 𝑤𝑖 = 𝑤𝑖 max 𝑗 𝑤𝑗
  • 16. CostProportionateSampling Cost Proportionate Over Sampling Example 𝑦𝑖 𝑤𝑖 1 0 1 2 1 10 3 0 2 4 1 20 5 0 1 Initial Dataset (1,0,1) (2,1,10) (3,0,2) (4,1,20) (5,0,1) Cost Proportionate Dataset (1,0,1) (2,1,1), (2,1,1), …, (2,1,1) (3,0,2), (3,0,2) (4,1,1), (4,1,1), (4,1,1), …, (4,1,1), (4,1,1) (5,0,1) *Elkan, C. (2001). The Foundations of Cost-Sensitive Learning.
  • 17. CostProportionateSampling Cost Proportionate Rejection Sampling Example 𝑦𝑖 𝑤𝑖 1 0 1 2 1 10 3 0 2 4 1 20 5 0 1 Cost Proportion ate Dataset (2,1,1) (4,1,1) (4,1,1) (5,0,1) *Zadrozny et al. (2003). Cost-sensitive learning by cost-proportionate example weighting. 𝑤𝑖/max( 𝑤𝑖) 0.05 0.5 0.1 1 0.05 Initial Dataset (1,0,1) (2,1,10) (3,0,2) (4,1,20) (5,0,1)
  • 18. Decision model based on quantifying tradeoffs between various decisions using probabilities and the costs that accompany such decisions Risk of classification 𝑅 𝑐𝑖 = 0|𝑥𝑖 = 𝐶 𝑇𝑁 𝑖 1 − 𝑝𝑖 + 𝐶 𝐹𝑁 𝑖 ∙ 𝑝𝑖 𝑅 𝑐𝑖 = 1|𝑥𝑖 = 𝐶 𝐹𝑃 𝑖 1 − 𝑝𝑖 + 𝐶 𝑇𝑃 𝑖 ∙ 𝑝𝑖 Using the different risks the prediction is made based on the following condition: 𝑐𝑖 = 0 𝑅 𝑐𝑖 = 0|𝑥𝑖 ≤ 𝑅 𝑐𝑖 = 1|𝑥𝑖 1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 18 BayesMinimumRisk
  • 19. • Logistic Regression Model • Cost Function • Cost Analysis Cost-Sensitive- LogisticRegression
  • 20. • Actual Costs • Cost-Sensitive Function Cost-Sensitive- LogisticRegression
  • 21. 21 Proposed Cost based impurity measure 𝑆 𝑙 = 𝑥|𝑥𝑖 ∈ 𝑆 ∧ 𝑥𝑖 𝑗 ≤ 𝑙 𝑚 𝑗 𝑆 𝑟 = 𝑥|𝑥𝑖 ∈ 𝑆 ∧ 𝑥𝑖 𝑗 > 𝑙 𝑚 𝑗 • The impurity of each leaf is calculated using: 𝐼𝑐 𝑆 = min 𝐶𝑜𝑠𝑡 𝑓0 𝑆 , 𝐶𝑜𝑠𝑡 𝑓1 𝑆 𝑓 𝑆 = 0 𝑖𝑓 𝐶𝑜𝑠𝑡 𝑓0 𝑆 ≤ 𝐶𝑜𝑠𝑡 𝑓1 𝑆 1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 • Afterwards the gain of applying a given rule to the set 𝑆 is: 𝐺𝑎𝑖𝑛 𝑐 𝑥 𝑗, 𝑙 𝑚 𝑗 = 𝐼𝑐 𝜋1 − 𝐼𝑐 𝜋1 𝑙 + 𝐼𝑐 𝜋1 𝑟 S S S 𝑥 𝑗, 𝑙 𝑚 𝑗 Cost-SensitiveDecisionTrees
  • 22. 22 Decision trees construction • The rule that maximizes the gain is selected 𝑏𝑒𝑠𝑡 𝑥, 𝑏𝑒𝑠𝑡𝑙 = 𝑎𝑟𝑔 max 𝑗,𝑚 𝐺𝑎𝑖𝑛 𝑥 𝑗, 𝑙 𝑚 𝑗 S S S S S S S S S S S • The process is repeated until a stopping criteria is met: Cost-SensitiveDecisionTrees
  • 23. 23 Proposed cost-sensitive pruning criteria • Calculation of the Tree savings and pruned Tree savings S S S S S S S S S S S 𝑃𝐶𝑐 = 𝐶𝑜𝑠𝑡 𝑓 𝑆, 𝑇𝑟𝑒𝑒 − 𝐶𝑜𝑠𝑡 𝑓 𝑆, 𝐸𝐵 𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑛𝑐ℎ 𝑇𝑟𝑒𝑒 − 𝐸𝐵 𝑇𝑟𝑒𝑒, 𝑏𝑟𝑎𝑛𝑐ℎ • After calculating the pruning criteria for all possible trees. The maximum improvement is selected and the Tree is pruned. • Later the process is repeated until there is no further improvement. S S S S S S S S S S S S S S Cost-SensitiveDecisionTrees
  • 24. Typical ensemble is made by combining T different base classifiers. Each base classifiers is trained by applying algorithm M in a random subset 24 𝑀𝑗 ← 𝑀 𝑆𝑗 ∀𝑗 ∈ 1, … , 𝑇 EnsembleCost-SensitiveDecisionTrees
  • 26. After the base classifiers are constructed they are typically combined using one of the following methods: • Majority voting 𝐻 𝑆 = 𝑓𝑚𝑣 𝑆, 𝑀 = 𝑎𝑟𝑔 max 𝑐∈ 0,1 𝑗=1 𝑇 1 𝑐 𝑀𝑗 𝑆 26 EnsembleCost-SensitiveDecisionTrees
  • 27. • Proposed cost-sensitive stacking 𝐻 𝑆 = 𝑓𝑠 𝑆, 𝑀, 𝛽 = 1 1 + 𝑒 − 𝑗=1 𝑇 𝛽 𝑗 𝑀 𝑗 𝑆 Using the cost-sensitive logistic regression [Correa et. al, 2014] model: 𝐽 𝑆, 𝑀, 𝛽 = 𝑖=1 𝑁 𝑦𝑖 𝑓𝑠 𝑆, 𝑀, 𝛽 𝐶 𝑇𝑃 𝑖 − 𝐶 𝐹𝑁𝑖 + 𝐶 𝐹𝑁 𝑖 + 1 − 𝑦𝑖 𝑓𝑠 𝑆, 𝑀, 𝛽 𝐶 𝐹𝑃 𝑖 − 𝐶 𝑇𝑁 𝑖 + 𝐶 𝑇𝑁 𝑖 Then the weights are estimated using 𝛽 = 𝑎𝑟𝑔 min 𝛽 𝐽 𝑆, 𝑀, 𝛽 27 EnsembleCost-SensitiveDecisionTrees
  • 31. • New framework for stacking of example dependent cost-sensitive decision trees • Models should be evaluated taking into account real financial costs of the application • Algorithms should be developed to incorporate those financial costs Conclusions 31
  • 32. Any questions or comments, please let me know. Alejandro Correa Bahnsen, PhD Chief Data Scientist & Head of Research acorrea@easysol.net Thank you!