3. Cover w/ Image
Agenda
● Introductions
● Data Science Process
● Model Operationalization
● Introduction to MADlib Flow
● Case Study: AI for transaction fraud
● Live demo!
● Q&A
4. Data Science Process
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
5. Model Operationalization
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
Model Operationalization
is the process of deploying data
science models to production
for ongoing use by other
software
6. Where Most AI & ML Projects Fail
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
Model Operationalization
is where the majority of
artificial intelligence initiatives
fail
7. Common Challenges With Operationalizing Models
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
Common challenges with model
operationalization:
● Handling production data
● Engineering for scale and
performance
● Model transportation
● Managing and orchestrating
deployed models
● Data Scientists are not
developers or platform
experts
8. BATCH TRAINING
BATCH INFERENCE
~40% of today’s use cases
Tax Return Fraud: Score database of
tax returns - on a nightly basis - to flag
likely fraudulent returns for audit
EVENT DRIVEN
TRAINING EVENT
DRIVEN INFERENCE
<5% today’s use cases
Online Advertising: Maximize Click
Thru Rate by algorithmically selecting
and testing advertisement placement in
real time
BATCH TRAINING
EVENT DRIVEN
INFERENCE
~55% today’s use cases (growing)
Real Time Transaction Fraud: Train
a ML model on historical data to
classify - in real time - whether or not
new credit/debit transactions are likely
to be fraudulent
EXAMPLE
Patterns For Operationalizing Models
EXAMPLE EXAMPLE
PotsgreSQL/Greenplum
with MADlib supports
this pattern
PostgreSQL/Greenplum
with MADlib & MADlib
Flow supports this
pattern
Highly specialized – low
number of enterprise use
cases
9. Existing Approaches To Model Operationalization Have
Failed
Data science Engineering
Production
Model persistence
Approaches
1. Rewrite the code
2. Universal markup
model language
(PMML and PFA)
Most models never make it to production
11. Containerized Deployment Of Models
$ madlibflow --deploy --target kubernetes --type model
Key benefits of MADlib Flow
● Easy to deploy & light weight
● Highly scalable REST and Streaming
● End-to-end SQL workflow
● Low latency inference/predictions
● Feature Transformations
Single command to deploy a MADlib
trained model from GPDB/Postgres to
Docker, PCF or Kubernetes
Containerized deployment of Apache MADlib Machine Learning workflows for low
latency event driven inference and scale
12. AI For The PostgreSQL Community
Standardized end-to-end Data Science in SQL with the Greenplum/Postgres stack
Experimentation
Initial code development and testing,
model experimentation on samples.
Modeling at Scale
Heavy compute tasks such as model
training across big data
Deployment
Production deployment of models to feed
downstream applications and reports
Artificial
Intelligence
: Closed
Loop
Machine
Learning
13. Model Deployment With MADlib Flow
1
ML Training
Train ML model in
Postgres or Greenplum
using Apache MADlib
madlibflow --
deploy
Set configs in .yml and
deploy model from
Greenplum to Docker,
PCF or Kubernetes
2
Docker pull
Pull docker containers
with optimized Postgres
and MADlib
3
Pull Model
Extract model and
feature table schema
layout from Greenplum
database
4
Load Model
Load model and feature
table schema into
optimized Postgres
5
Deploy
Deploy docker container
to target environment
6
Automated Backend OperationsUser Operations
15. Demo: Deploy A Model
High level steps
● Connect to Greenplum
● Load data
● Build and train a model
● Deploy and test model on Greenplum
● Deploy model using MADlib Flow to GKE
● Test
16. Case Study: Credit Card Transaction Fraud model
Transactions
Topic
Greenplum
Scored Transactions Topic
PCF [PKS]
17. Event Scoring with Greenplum and Containerized Postgres
MADlib REST
(Scoring)
Scoring
Decision
(JSON)
New Transaction
Event (JSON)
Read
Features
Read Features from
DB (scheduled)
Cache
Feature
s
Join
Event
&
Feature
Data
Bootstrap MADlib Model
Cache
Manager
Update DB w/
Scoring
Decision
MADlib Flow Returns
Scoring Decision to
GPDB
Load New Transaction Event
REST
API
Update DB
w/ New
Event
Feature
Engine
MADlib Flow
(Orchestrator
)
Pivotal
Cloud
Cache
Scoring
Decision
18. MADlibFlow
Greenplum Database
Feature Engine
Cache Loader and Feature Engine Services
Credit/Debit Card Transaction
(Input)
Message
{
“transaction_ts”: ,
“credit_card_number”: ,
“transaction_amt”:,
“merchant_id”:
}
Approved Credit/Debit Card
Transaction
(Output)
Message
{
“transaction_ts”: ,
“transaction_amt”:,
“credit_card_number”:,
“num_transactions_30days”:,
“max_transactions_30days”:,
“merchant_id”:,
“num_fraud_cases”:,
“avg_transaction_amount_30days”:,
“fraud_risk_score”: 0.92,
“approved”: True
}
Accounts
credit_card_number
num_transactions_30days
max_transactions_30days
Merchants
merchant_id
num_fraud_cases
avg_transaction_amount_30days
Cache
(Gemfire, PCC, Redis, etc.)
Cache Abstraction
Cache Abstraction
SELECT mch.*
,acct.*
,log(msg.transaction_amt + 1) AS log_transaction_amt
FROM message msg
JOIN merchants mch ON
msg.merchant_id=mch.merchant_id
JOIN accounts acct ON
msg.credit_card_number=acct.credit_card_number;
MADlib REST
Cache Loader
Automated deployment
of scalable low latency
end-to-end ML pipelines
(“Data Science Ops.”)
No code conversion -
engineer features and
populate cache in SQL
Join data from the
incoming message with
cached data
Accounts Merchants
SELECT create_accounts(); SELECT create_merchants();
20. Event Scoring with Greenplum and Containerized Postgres
1. Credit Fraud Model Building On Greenplum
2. Credit Fraud Model Deployment with MADlib Flow
a. Flow contains
i. Model,
ii. feature engine,
iii. features cache (refreshable via REST)
3. Kafka Stream Processing for Event Scoring.
a. One Kafka producer
b. One Kafka streams consumer
21. Event Scoring with Greenplum and Containerized Postgres
Show me the Demo!
22. Roadmap
● Container and service endpoint security
● Support for Python model deployments via PL/Python
● Support for R model deployments via PL/R
● Support for Deep learning modules (Tensorflow, PyTorch)
● A comprehensive model management UI
○ Model versioning and updates
○ Champion / challenger testing
● Target GA is late Spring