Online fraud costs the global economy more than $400 billion, with more than 800 million personal records stolen in 2013 alone. Increasingly, fraud has diversified to different digital channels, including mobile and online payments, creating new challenges as innovative fraud patterns emerge. Hence it is still a challenge to find effective methods to mitigate fraud. Existing solutions include simple if-then rules and classical machine learning algorithms.
From an academic perspective, credit card fraud detection is a standard classification problem, in which historical transaction data is used to predict future frauds. However, practical aspects make the problem more complex. Indeed, existent comparison measures lack a realistic representation of monetary gains and losses, which is necessary for effective fraud detection. Moreover, there is an enormous amount of transactions from which only a tiny part are frauds, which implies a huge class imbalance. Additionally, a real fraud detection system is required to give a response in milliseconds. This criterion needs to be taken into account in the modeling process in order for the system to be successfully implemented. To solve these problems, in this presentation two recently proposed algorithms are compared: Bayes minimum risk and example-dependent cost-sensitive decision tree. These methods are compared with state of the art algorithms and shows significant improvements measured by financial savings.
2. About me
• PhD in Machine Learning at Luxembourg University
• Data Scientist at Easy Solutions
• Worked for +8 years as a data scientist at GE Money, Scotiabank
and SIX Financial Services
• Bachelor and Master in Industrial Engineering
• Organizer of the Big Data & Data Science Bogota Meetup
2
3. About us
Industry recognitionA leading global provider of electronic fraud
prevention for financial institutions and enterprise
customers
280+ customers
In 26 countries
75 million
Users protected
22+ billion
Online connections monitored in
last 12 months
3
9. 1.10%
1.30%
1.10%
0.90% 0.88% 0.87%
0.09% 0.08% 0.08% 0.06% 0.05% 0.05%
2006 2007 2008 2009 2010 2011
Card Present vs. Card Not Present Fraud Rates
Card Not Present Card Present
23.3
26.8
30.0
33.3
35.0
2009 2010 2011 2012 2013
US Online Banking
Billions of Transactions
1.2
3.0
5.6
9.4
14.0
2009 2010 2011 2012 2013
US Mobile Banking
Billions of Transactions
9
10. 10
La Banca Móvil continúa creciendo mientras los canales
tradicionales pierden usuarios
¿Qué medios usa para realizar operaciones bancarias / consulta de saldo / pagos de servicios
/pago de impuestos u otros pagos o compras
12. 12
La principal razón de quienes NO usan Internet para
transacciones o compras es el temor al fraude electrónico
¿Por qué NO USA Internet para realizar operaciones bancarias, pagos o compras?
13. There is a need for
better fraud
detection strategies
13
19. Big data (Data Science) is like teenage sex:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it...
19
22. Man on the Moon
Distance: 356,000Km
Never been there
before
Must return to Earth
22
Man on the Moon – Small Data!!
Apollo XI
Speed: 3,500 km/hour
Weight: 13,500kg
Lots of complex data
Computer Program
64kb, 2Kb RAM,
Fortran
Must work the first
time
23. Apollo XI, 1969
64Kb, 2Kb RAM
23
Man on the Moon – Small Data!!
iphone 6
128GB, 2GB RAM
25. BigData Analytics is the
use of methods and
tools of Machine
Learning and Artificial
Intelligence with the
objective making data-
driven decisions
25
27. Estimate the probability of a transaction being fraud based on analyzing
customer patterns and recent fraudulent behavior
Issues when constructing a fraud detection system:
• Skewness of the data
• Cost-sensitivity
• Short time response of the system
• Dimensionality of the search space
• Feature preprocessing
• Model selection
27
Credit card fraud detection
29. • Larger European card processing
company
• 2012 & 2013 card present
transactions
• 20MM Transactions
• 40,000 Frauds
• 0.467% Fraud rate
• ~ 2MM EUR lost due to fraud on
test dataset
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
Test
Train
Data
30. • “Purpose is to use facts and rules, taken from the knowledge
of many human experts, to help make decisions.”
• Example of rules
• More than 4 ATM transactions in one hour?
• More than 2 transactions in 5 minutes?
• Magnetic stripe transaction then internet transaction?
30
If-Then rules (Expert rules)
32. Credit card fraud detection is a cost-sensitive problem. As the cost due to a
false positive is different than the cost of a false negative.
• False positives: When predicting a transaction as fraudulent, when in
fact it is not a fraud, there is an administrative cost that is incurred by
the financial institution.
• False negatives: Failing to detect a fraud, the amount of that transaction
is lost.
Moreover, it is not enough to assume a constant cost difference between
false positives and false negatives, as the amount of the transactions varies
quite significantly.
32
Financial evaluation
36. Fraud Analytics is the use of statistical
and mathematical techniques (Machine
Learning) to discover patterns in data in
order to make predictions
Fraud Analytics
37. Raw features
37
Attribute name Description
Transaction ID Transaction identification number
Time Date and time of the transaction
Account number Identification number of the customer
Card number Identification of the credit card
Transaction type ie. Internet, ATM, POS, ...
Entry mode ie. Chip and pin, magnetic stripe, ...
Amount Amount of the transaction in Euros
Merchant code Identification of the merchant type
Merchant group Merchant group identification
Country Country of trx
Country 2 Country of residence
Type of card ie. Visa debit, Mastercard, American Express...
Gender Gender of the card holder
Age Card holder age
Bank Issuer bank of the card
Features
38. Transaction aggregation strategy
38
Raw Features
TrxId Time Type Country Amt
1 1/1 18:20 POS Lux 250
2 1/1 20:35 POS Lux 400
3 1/1 22:30 ATM Lux 250
4 2/1 00:50 POS Ger 50
5 2/1 19:18 POS Ger 100
6 2/1 23:45 POS Ger 150
7 3/1 06:00 POS Lux 10
Aggregated Features
No Trx
last 24h
Amt last
24h
No Trx
last 24h
same
type and
country
Amt last
24h same
type and
country
0 0 0 0
1 250 1 250
2 650 0 0
3 900 0 0
3 700 1 50
2 150 2 150
3 400 0 0
Features
39. When is a customer expected to
make a new transaction?
Considering a von Mises
distribution with a period of 24
hours such that
𝑃(𝑡𝑖𝑚𝑒) ~ 𝑣𝑜𝑛𝑚𝑖𝑠𝑒𝑠 𝜇, 𝜎
=
𝑒 𝜎𝑐𝑜𝑠(𝑡𝑖𝑚𝑒−𝜇)
2𝜋𝐼0 𝜎
where 𝝁 is the mean, 𝝈 is the standard
deviation, and 𝑰 𝟎 is the Bessel function
39
Periodic features
45. 45
Decision Trees
X1=Amountofthetransaction
X2= Number of transactions last day
A decision tree is a classification model that iteratively creates binary
decision rules that maximize certain criteria (Gini, entropy, …).
Initial
Node
X2<10 X2≥10
X1<100
X1<50
X2<15 X2≥15
X1≥50
X1≥100
46. A Random Forest is made by combining many different decision trees. Each
one trained on a random subset of the initial dataset
46
Random Forests
48. 48
Cost-Sensitive Decision Trees
• Standard decision trees create rules
that maximize either the Gini or the
entropy measures
• However this assumes that all
misclassification errors carry the same
cost
• Not true in fraud detection
• Instead the cost-sensitive decision tree
minimizes the cost of each rule
𝐶𝑜𝑠𝑡 𝑓 𝑛𝑜𝑑𝑒
Initial
Node
X2<10 X2≥10
X1<100
X1<50
X2<15 X2≥15
X1≥50
X1≥100
50. • Fraud Analytics (ML) models are significantly
better than expert rules
• Models should be evaluated taking into
account real financial costs of the application
• Algorithms should be developed to
incorporate those financial costs
Conclusions
50
En 2015, el Internet y la tecnología móvil han solidificado su estatus en Latinoamérica como los canales más populares para operaciones bancarias, pagos y compras. Las oficinas bancarias continúan perdiendo uso y menos del 30% de los usuarios utilizan regularmente canales tradicionales como cajeros automáticos o sistemas de audio-respuesta.
Está claro que los usuarios de transacciones en Internet muestran una clara preferencia por eliminar el efectivo de sus transacciones tanto como les sea posible, y el uso de tarjetas de crédito parece seguir esta tendencia debido a que los usuarios cada vez más prefieren manejar sus operaciones en computadoras y dispositivos móviles.
Si bien el uso de dispositivos móviles para realizar operaciones financieras continúa creciendo, aún existe resistencia de parte de los usuarios para utilizar estos dispositivos de la misma forma que sus computadoras, incluso siendo más convenientes al poderlos llevar a todas partes.
Internet se mantiene como el canal más frecuentemente usado, con un promedio de uso por persona de 3.8 veces por mes.
Anécdota de bancos en Colombia y sus filas.
Y los usuarios están en lo cierto al ser tan precavidos. Un estudio conducido por la empresa Arxan Technologies dice que un 95% de las principales aplicaciones móviles financieras para Android (y 70% de las de iOS) han sido hackeadas. En 2014, Trend Micro encontró que el 77% de las 50 aplicaciones gratuitas más descargadas de Google Play tenian versiones falsas, haciendo muy difícil para los usuarios detectar cuál es de ellas son auténticas o fraudulentas.
El análisis de la visión y opiniones de aquellos que regularmente utilizan la Internet para operaciones bancarias y compras es de gran importancia a la hora de diseñar una estrategia que intente aprovechar todo el potencial que este canal ofrece. No obstante, es de igual importancia el examinar aquellos usuarios que debido a una variedad de razones no utilizan Internet con propósitos de finanzas o comercio electrónico.
La principal razón mencionada por estos usuarios para no tomar ventaja de los servicios bancarios online fue el miedo al fraude electrónico. Si consideramos que los portales de banca online ofrecen mayor conveniencia a los usuarios y sus menores costos de operación benefician a las instituciones, entonces es imperativo que los bancos continúen investigando formas de promover la adopción de canales de banca electrónica. La prevención del fraude en estos canales no es sólo una forma de prevenir pérdidas económicas y proteger la reputación de las instituciones, una fuerte protección contra fraude también puede hacer que usuarios antes escépticos, adquieran la confianza necesaria para incorporar estos canales a su rutina bancaria normal, y que los bancos con tasas de adopción más altas obtengan una ventaja competitiva más amplia.
Analytics at work. Davenport 2010.
http://tagul.com/
The famous French general didn’t even live the information age, and yet he attributed most of his military success to having the right information. When you’re battling for a competitive advantage in business, analytics data can be equally important to your success.
The famous French general didn’t even live the information age, and yet he attributed most of his military success to having the right information. When you’re battling for a competitive advantage in business, analytics data can be equally important to your success.