Automating Google Workspace (GWS) & more with Apps Script
GANs for Anti Money Laundering
1. Jim Dowling
CEO / Co-Founder
Logical Clocks
Associate Prof at KTH – Royal Institute of Technology
Anti Money Laundering and GANs
Berlin Meetup
@jim_dowling
2. ● Problem: Increase detection rate and reduce costs for AML.
● Solution: We used the Hopsworks platform to train GANs to classify
transactions as suspected for money laundering or not. We have worked with
a large transaction dataset (~40 TB) and the solution uses Spark for Feature
Engineering and TensorFlow/GPUs to train a binary classifier, classifying
transactions as either clean or dirty. We use the open-source Hopsworks
platform to manage features, scale-out training, and manage models.
● Reference: Whitepaper
Agenda
3. ● Money laundering involves turning the “dirty” money into “clean” money either
through an obscure sequence of banking transfers or through commercial
transactions.
● The three broad stages of money laundering* are:
○ Placement (smurf it)
○ Layering (spread it out fast)
○ Integration (buy stuff)
What is Money Laundering?
*https://towardsdatascience.com/the-art-of-engineering-features-for-a-strong-machine-learning-model-a47a876e654c
5. AML as a Supervised ML Problem
● Anti-money laundering (AML) is a pattern
matching problem
● AML systems should automatically flag ‘suspect’
financial transactions
○ Followed by manual investigation
● Historical transaction datasets have massive
data imbalance between the number of ‘clean’
transactions versus ‘dirty’ transactions
Clean
Transactions
Dirty
Transactions
Millions or Billions
100s or 1000s
6. Implications of AML as a Binary Classification Problem
True Positive
Reality: A Money Laundering Transaction
Prediction: “Dirty” transaction predicted
Result: Good
False Positive
Reality: Not a Money Laundering Transaction
Prediction: “Dirty” transaction predicted
Result: Unnecessary work and cost!
False Negative
Reality: A Money Laundering Transaction
Prediction: “Clean” transaction predicted
Result: Fines/jail by authorities/regulator!
True Negative
Reality: Not a Money Laundering Transaction
Prediction: “Clean” transaction predicted
Result: Good
Confusion matrix of our Binary AML Classifier with all possible predictions and their consequences.
We use a variant of the F1 score to evaluate models (precision, recall, fallout should not be weighted equally).
7. AML as an Anomaly Detection Problem
“Anomaly detection follows
quite naturally from a good
unsupervised model”
Alex Graves (Deep Mind)
Traditional unsupervised
approaches do not scale:
k-means clustering and
principal component
analysis
[Image from Ruff et al, “Deep Semi-Supervised Anomaly Detection”, https://arxiv.org/pdf/1906.02694.pdf
8. AML - Semi-Supervised Anomaly Detection
AML is not a classical use-case for anomaly
detection as we typically have labelled
datasets, albeit imbalanced.
“Semi-supervised learning is a class of machine
learning tasks and techniques that also make
use of unlabeled data for training – typically a
small amount of labeled data with a large
amount of unlabeled data.” Wikipedia
9. GANs and Other Methods for Anomaly Detection
● Variational Auto-Encoders for Anomaly Detection
○ Easier to train, performance not state-of-the-art
● Generative Adversarial Networks (GANs)
○ Learn the manifold of normal samples (what to do if anomaly-free
dataset is polluted)
○ One-Class Classifier for Novelty Detection GAN
○ BiGAN, BigGAN, BigBiGAN, GANOMALY, f-AnoGAN, GANs for Fraud
“[For GANs] the Convolutional Neural Network architecture is more important than how you
train them”, Marc Aurelio Ranzato (Facebook) at NeurIPS 2018.
12. GANs are hard to train
● Pick the right GAN Architecture
● Risk of mode-collapse
● Hard to tune Hyperparameters
Different Hyperparameter Tuning Strategies
13. GANs are hard to train
● Mode collapse
○ Transactional data distributions are multimodal. There will be multiple types of
transactional behaviour that will be perfectly normal.
○ Original GAN is based on the zero-sum non-cooperative game. In these setting
when the mini-max game reaches the Nash equilibrium too soon. The generator
will learn to produce only a limited number of modes and mode collapse occurs.
● GANs are highly sensitive to the hyperparameters.
○ Finding good hyperparameters takes time, especially for GANs. List of possible
hyperparameters and tricks are listed here https://github.com/soumith/ganhacks
○ It is essential to have a good optimization and hyperparameter tuning engine
14. How to address mode collapse problem
● MO-GAAL [Liu, et al] proposed using multiple generators, where different
generators will be in charge of learning different modes of distribution.
● Schleg, et al in f-AnoGAN proposed replacing DCGAN with WGAN-GP and
introducing an encoder that was trained sequentially for image to latent
space mapping.
● Berg, et al improved f-AnoGAN by training Generator and Encoder jointly, as
well as employing progressive growing GAN.
16. [Image from Berg et Al - https://arxiv.org/pdf/1905.11034.pdf ]
WGAN-Gradient-Penalty Based Anomaly Detection
17. Will GANs help improve AML predictions?
Expected results from using GANs (Anomaly Detection at Spark/AI EU Summit 2019)
“[In China] two commercial
banks have reduced losses
of about 10 million RMB in
twelve weeks and
significantly improved their
business reputation”
GAN-based telecom fraud
detection at the receiving
bank
18. Online
Feature Store
Offline
Feature Store
Train,
Batch App
Feature Store
<10ms
TBs/PBs
How can we manage the Features between Training/Serving?
Recent transaction counts
(Steaming App)
Streaming App pushes CDC data
Pandas App updates every hour
Batch PySpark App pushes
updates every day
Low
Latency
Features
High
Latency
Features
Real-time features
(cust IDs, amount, type, datetime)
Real-time
Data
Event Data
SQL
S3, HDFS
Online AML
App
SQL DW
DataFrameAPI
19. HOPSWORKS
Offline FS
Apache Hive
HopsFS
Read and Join Features
Online FS
MySQL Cluster
(External)
Spark Cluster
fs.get_features([“name”, “Pclass”,
“Sex”, “Balance”, “Survived”])
Storage
(S3, HopsFS, HDFS, ADLS)
.npy, .tfrecords, .csv
Create AML Training Datasets
26. Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies
Rewrite your code at each stage => Iteration is impossible!
28. OBLIVIOUS
TRAINING
FUNCTION
# RUNS ON THE WORKERS
def train():
def input_fn(): # return dataset
model = …
optimizer = …
model.compile(…)
….
Ablation StudiesEDA HParam Tuning Training (Dist)
Obvlious Training Function – Write Once, Reuse Many Times
30. def aml(kernel, pool, dropout, reporter):
# This is your training iteration loop
for i in range(number_iterations):
...
# add the maggy reporter to report the metric to be optimized
reporter.broadcast(metric=accuracy)
...
# Return the same final metric
return accuracy
from maggy import experiment, Searchspace
sp = Searchspace(kernel=('INTEGER', [2, 8]), pool=('INTEGER', [2, 8]))
result = experiment.lagom(train=aml, searchspace=sp, optimizer='randomsearch’,
direction='max’, num_trials=15, name='MNIST’ )
Maggy for HParam Optimization
32. Get Started: Paysim AML Dataset (Kaggle)
● Graph-based Candidate Features, Concatenated Features
○ Link the origin account, destination account, and transaction type to track
the problem of smurfing and the higher cash withdrawals
● Frequency Candidate Features
○ Learn how frequently the account is used
● Amount Features
○ Magnitude of the amount of transactions.
● Time-Since Features
○ Learn the speed of transactions
● Velocity-Change Features
○ Identify a sudden change in the behaviour of accounts
https://www.kaggle.com/ntnu-testimon/paysim1?select=PS_20174392719_1491204439457_log.csv
33.
34. Hopsworks Cluster
Project-Based Multi-Tenant Security
API
KEY
IAM Profile
Users
Jobs
Dev Feature Store
Staging Feature Store
Prod Feature Store
User
Login
(LDAP, AD,
OAuth2, 2FA)
databricks
SageMaker
Kubeflow
Amazon EMR
Delta LakeSnowflakeAmazon S3
Amazon
Redshift
35. Full Featured
AGPL-v3 License Model
Hopsworks Community
Kubernetes Support
• Model Serving
• Other services for robustness (Jupyter, more coming)
Authentication (LDAP, Kerberos, OAuth2)
Github support
Hopsworks Enterprise
Managed SAAS platform (currently only on AWS)
Hopsworks.ai
Hopsworks – open-source or managed platform