GANs for Anti Money Laundering

Jim Dowling
CEO / Co-Founder
Logical Clocks
Associate Prof at KTH – Royal Institute of Technology
Anti Money Laundering and GANs
Berlin Meetup
@jim_dowling

● Problem: Increase detection rate and reduce costs for AML.
● Solution: We used the Hopsworks platform to train GANs to classify
transactions as suspected for money laundering or not. We have worked with
a large transaction dataset (~40 TB) and the solution uses Spark for Feature
Engineering and TensorFlow/GPUs to train a binary classiﬁer, classifying
transactions as either clean or dirty. We use the open-source Hopsworks
platform to manage features, scale-out training, and manage models.
● Reference: Whitepaper
Agenda

● Money laundering involves turning the “dirty” money into “clean” money either
through an obscure sequence of banking transfers or through commercial
transactions.
● The three broad stages of money laundering* are:
○ Placement (smurf it)
○ Layering (spread it out fast)
○ Integration (buy stuff)
What is Money Laundering?
*https://towardsdatascience.com/the-art-of-engineering-features-for-a-strong-machine-learning-model-a47a876e654c

Rules-Base AML vs Deep Learning AML

AML as a Supervised ML Problem
● Anti-money laundering (AML) is a pattern
matching problem
● AML systems should automatically ﬂag ‘suspect’
ﬁnancial transactions
○ Followed by manual investigation
● Historical transaction datasets have massive
data imbalance between the number of ‘clean’
transactions versus ‘dirty’ transactions
Clean
Transactions
Dirty
Transactions
Millions or Billions
100s or 1000s

Implications of AML as a Binary Classiﬁcation Problem
True Positive
Reality: A Money Laundering Transaction
Prediction: “Dirty” transaction predicted
Result: Good
False Positive
Reality: Not a Money Laundering Transaction
Prediction: “Dirty” transaction predicted
Result: Unnecessary work and cost!
False Negative
Reality: A Money Laundering Transaction
Prediction: “Clean” transaction predicted
Result: Fines/jail by authorities/regulator!
True Negative
Reality: Not a Money Laundering Transaction
Prediction: “Clean” transaction predicted
Result: Good
Confusion matrix of our Binary AML Classifier with all possible predictions and their consequences.
We use a variant of the F1 score to evaluate models (precision, recall, fallout should not be weighted equally).

AML as an Anomaly Detection Problem
“Anomaly detection follows
quite naturally from a good
unsupervised model”
Alex Graves (Deep Mind)
Traditional unsupervised
approaches do not scale:
k-means clustering and
principal component
analysis
[Image from Ruff et al, “Deep Semi-Supervised Anomaly Detection”, https://arxiv.org/pdf/1906.02694.pdf

AML - Semi-Supervised Anomaly Detection
AML is not a classical use-case for anomaly
detection as we typically have labelled
datasets, albeit imbalanced.
“Semi-supervised learning is a class of machine
learning tasks and techniques that also make
use of unlabeled data for training – typically a
small amount of labeled data with a large
amount of unlabeled data.” Wikipedia

GANs and Other Methods for Anomaly Detection
● Variational Auto-Encoders for Anomaly Detection
○ Easier to train, performance not state-of-the-art
● Generative Adversarial Networks (GANs)
○ Learn the manifold of normal samples (what to do if anomaly-free
dataset is polluted)
○ One-Class Classiﬁer for Novelty Detection GAN
○ BiGAN, BigGAN, BigBiGAN, GANOMALY, f-AnoGAN, GANs for Fraud
“[For GANs] the Convolutional Neural Network architecture is more important than how you
train them”, Marc Aurelio Ranzato (Facebook) at NeurIPS 2018.

GAN Discriminator-Based Anomaly Detection

GANs are hard to train
● Pick the right GAN Architecture
● Risk of mode-collapse
● Hard to tune Hyperparameters
Different Hyperparameter Tuning Strategies

GANs are hard to train
● Mode collapse
○ Transactional data distributions are multimodal. There will be multiple types of
transactional behaviour that will be perfectly normal.
○ Original GAN is based on the zero-sum non-cooperative game. In these setting
when the mini-max game reaches the Nash equilibrium too soon. The generator
will learn to produce only a limited number of modes and mode collapse occurs.
● GANs are highly sensitive to the hyperparameters.
○ Finding good hyperparameters takes time, especially for GANs. List of possible
hyperparameters and tricks are listed here https://github.com/soumith/ganhacks
○ It is essential to have a good optimization and hyperparameter tuning engine

How to address mode collapse problem
● MO-GAAL [Liu, et al] proposed using multiple generators, where different
generators will be in charge of learning different modes of distribution.
● Schleg, et al in f-AnoGAN proposed replacing DCGAN with WGAN-GP and
introducing an encoder that was trained sequentially for image to latent
space mapping.
● Berg, et al improved f-AnoGAN by training Generator and Encoder jointly, as
well as employing progressive growing GAN.

WGAN-Gradient-Penalty Based Anomaly Detection
[Image from Berg et Al - https://arxiv.org/pdf/1905.11034.pdf ]

[Image from Berg et Al - https://arxiv.org/pdf/1905.11034.pdf ]
WGAN-Gradient-Penalty Based Anomaly Detection

Will GANs help improve AML predictions?
Expected results from using GANs (Anomaly Detection at Spark/AI EU Summit 2019)
“[In China] two commercial
banks have reduced losses
of about 10 million RMB in
twelve weeks and
significantly improved their
business reputation”
GAN-based telecom fraud
detection at the receiving
bank

Online
Feature Store
Oﬄine
Feature Store
Train,
Batch App
Feature Store
<10ms
TBs/PBs
How can we manage the Features between Training/Serving?
Recent transaction counts
(Steaming App)
Streaming App pushes CDC data
Pandas App updates every hour
Batch PySpark App pushes
updates every day
Low
Latency
Features
High
Latency
Features
Real-time features
(cust IDs, amount, type, datetime)
Real-time
Data
Event Data
SQL
S3, HDFS
Online AML
App
SQL DW
DataFrameAPI

HOPSWORKS
Oﬄine FS
Apache Hive
HopsFS
Read and Join Features
Online FS
MySQL Cluster
(External)
Spark Cluster
fs.get_features([“name”, “Pclass”,
“Sex”, “Balance”, “Survived”])
Storage
(S3, HopsFS, HDFS, ADLS)
.npy, .tfrecords, .csv
Create AML Training Datasets

Model Development Lifecycle in Hopsworks

Hopsworks Conventions
/training_datasets
/models
/logs
/notebooks
/featurestore
Conventions and Implicit Provenance in Hopsworks*
*https://www.usenix.org/conference/opml20/presentation/ormenisan
In [
]:
dataset = tf.data.Dataset.list_files("training_datasets/resnet/*.tfrecord")
tf.saved_model.save(model, ‘models/ResNet’)
maggy.lagom(....)

Exploration
Experimentati
on
Model
Training
Explainability
Validation
Serving
Feature
Pipelines
ML Model Development Lifecycle

Hyperparameter Tuning for GANs

Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies
ML Model Dev Lifecycle is Iternative

Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies
Rewrite your code at each stage => Iteration is impossible!

Ablation StudiesEDA HParam Tuning Training (Dist)
It’s the Frameworks’ fault – they make us rewrite it!

OBLIVIOUS
TRAINING
FUNCTION
# RUNS ON THE WORKERS
def train():
def input_fn(): # return dataset
model = …
optimizer = …
model.compile(…)
….
Ablation StudiesEDA HParam Tuning Training (Dist)
Obvlious Training Function – Write Once, Reuse Many Times

def dataset(batch_size):
(x_train, y_train) = load_data()
x_train = x_train / np.float32(255)
y_train = y_train.astype(np.int64)
train_dataset = tf.data.Dataset.from_tensor_slices(
(x_train,y_train)).shuffle(60000)
.repeat().batch(batch_size)
return train_dataset
def build_and_compile_cnn_model(lr):
model = tf.keras.Sequential([
tf.keras.Input(shape=(28, 28)),
tf.keras.layers.Conv2D(32, 3, activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
loss=SparseCategoricalCrossentropy(from_logits=True),
optimizer=SGD(learning_rate=lr))
return model
def dataset(batch_size):
(x_train, y_train) = load_data()
x_train = x_train / np.float32(255)
y_train = y_train.astype(np.int64)
train_dataset = tf.data.Dataset.from_tensor_slices(
(x_train,y_train)).shuffle(60000)
.repeat().batch(batch_size)
return train_dataset
def build_and_compile_cnn_model(lr):
model = tf.keras.Sequential([
tf.keras.Input(shape=(28, 28)),
tf.keras.layers.Conv2D(32, 3, activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
loss=SparseCategoricalCrossentropy(from_logits=True),
optimizer=SGD(learning_rate=lr))
return model
NO
CHANGES!
What is Transparent Code in Practice?

def aml(kernel, pool, dropout, reporter):
# This is your training iteration loop
for i in range(number_iterations):
...
# add the maggy reporter to report the metric to be optimized
reporter.broadcast(metric=accuracy)
...
# Return the same final metric
return accuracy
from maggy import experiment, Searchspace
sp = Searchspace(kernel=('INTEGER', [2, 8]), pool=('INTEGER', [2, 8]))
result = experiment.lagom(train=aml, searchspace=sp, optimizer='randomsearch’,
direction='max’, num_trials=15, name='MNIST’ )
Maggy for HParam Optimization

Maggy is built on top of PySpark

Get Started: Paysim AML Dataset (Kaggle)
● Graph-based Candidate Features, Concatenated Features
○ Link the origin account, destination account, and transaction type to track
the problem of smurﬁng and the higher cash withdrawals
● Frequency Candidate Features
○ Learn how frequently the account is used
● Amount Features
○ Magnitude of the amount of transactions.
● Time-Since Features
○ Learn the speed of transactions
● Velocity-Change Features
○ Identify a sudden change in the behaviour of accounts
https://www.kaggle.com/ntnu-testimon/paysim1?select=PS_20174392719_1491204439457_log.csv

Hopsworks Cluster
Project-Based Multi-Tenant Security
API
KEY
IAM Profile
Users
Jobs
Dev Feature Store
Staging Feature Store
Prod Feature Store
User
Login
(LDAP, AD,
OAuth2, 2FA)
databricks
SageMaker
Kubeflow
Amazon EMR
Delta LakeSnowflakeAmazon S3
Amazon
Redshift

Full Featured
AGPL-v3 License Model
Hopsworks Community
Kubernetes Support
• Model Serving
• Other services for robustness (Jupyter, more coming)
Authentication (LDAP, Kerberos, OAuth2)
Github support
Hopsworks Enterprise
Managed SAAS platform (currently only on AWS)
Hopsworks.ai
Hopsworks – open-source or managed platform

Thank you.
github.com/logicalclocks/hopsworks
-
@logicalclocks
-
www.logicalclocks.com

GANs for Anti Money Laundering

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a GANs for Anti Money Laundering

Similar a GANs for Anti Money Laundering (20)

Más de Jim Dowling

Más de Jim Dowling (20)

Último

Último (20)

GANs for Anti Money Laundering