Standard Bank is a leading South African bank with a vision to be the leading financial services organization in and for Africa. We will share our vision, greatest challenges, and most valuable lessons learned on our journey towards enterprise adoption of a big data strategy.
This includes our implementation of: a multi-tenant enterprise data lake, a real time streaming capability, appropriate data management and governance principles, a data science workbench, and a process for model productionisation to support data science teams across the Group and across Africa and Europe.
Speakers
Zakeera Mahomen, Standard Bank, Big Data Practice Lead
Kristel Sampson, Standard Bank, Platform Lead
2. Big Data
Journey: Lake
2
Setup POC
Environment for Data
Science Exploration
Begin Data
Lake
Journey
Get critical mass
(Enterprise Data)
ingested into
Lake
Create a
multitenant
Lake
environment
Start onboarding tenants
(Data Science teams) onto
Lake. Each team has
‘sandbox’ environment to
start prototyping models
Setup Data
Science
Workbench
on Lake
Finalise Model
Productionisation
process
Implement Real
time Streaming
capability
Productionise first data
science (Real time
Machine Learning)
model on Lake
Establish Data
Science Guild
Establish Security
Policies for Data
access
Integrate Lake
metadata into
Enterprise
metadata
repository
Move to use case
driven prioritization
for DS model
productionisation
2016 2017 2018
3. Successes
• Security
• Getting critical mass of Enterprise
Data into the Lake
• Establishment of our Data Science
workbench
• Defining the model productionisation
standards
4. Challenges
• Security
• Integrating with other systems (Kerberos)
• Open source Connectors (Kerberos)
• Proprietary Connectors (SAS/ SAP/ Actimise/ IBM)
• Skills gap
• Data Lake vs Hadoop and Strategy
• No demand to oversubscription
6. 5 Pillars of Enterprise Security on Hadoop
Pillar Intent Tool Pillar
Administration
How do I set
policy?
Ambari/Ranger Administration
Authentication Who am I? Kerberos/LDAP Authentication
Authorization What can I do? Ranger/Centrify Authorization
Audit What did I do? Ranger/Centrify Audit
Data Protection
How can I
encrypt data?
Ranger KMS/SSL Data Protection
6
9. Edge Node
2
Edge Node 2Edge Node 1
CIA Data Services
Enterprise Lake
Proprietary
Data Science
Workbench
KVM
(Active/Passive)
Load Balanced
Virtual
Machines
Application
Development
Test
Application
Development
Test
Repo
(in DMZ)
e.g. R
Managed Queues Managed Storage
Common OS
Apps
Common OS
Apps
Production
Workbench
Approved list of
commonly used
open source apps
Setting up multiple tenants
10. Edge Node 1
Data Services
Enterprise Lake
Edge Node
Virtual
Machines
Managed Queues Managed Storage
Self Service on the Edge node
Data Science Workbench
2017 Q1
2018
Feb
2018
Ability to build
streaming data
pipelines
R Studio
Enterprise Server
Ability to source data onto
Edge node
• Up to 10 TB and copy
into their own Dev
folder(1TB) on HDFS to
run distributed
R Studio + Jupiter
Notebooks + Spark R –
sufficient tooling
available to build models
Data Science team setup with own edge
node
Enable
Power BI?
Ability to install
applications on the
Edge Node
11. Private Cloud: Africa Regions
Step 1: Near RT and Batch
Internal: CDC
HDF: Kafka/Nifi/
Storm
External:
Streaming
Internal/ Ext :
file based
EL:
Abinitio/Spark
Feature
Extraction
Batch
Model
Training
Spark
model
Model Results
via API
Data
Persiste
d: HDFS/
Hbase/
Elastic/
etc
Exposed
model
(PMML)
Continuous
Integration
Model
via API
Model deployed/
replaced
Regional Systems
South Africa: Data Lake
Regional
Reservoir and
Data Warehouse
In Country
Feature
Extraction
Batch
Model
Training
Spark
model
Exposed
model
(PMML)
Continuous
Integration
12. Private Cloud: Africa Regions
Step 1: Near RT and Batch
Internal: CDC
HDF: Kafka/Nifi/
Storm
External:
Streaming
Internal/ Ext : file
based
EL: Abinitio/Spark
Feature
Extraction
Batch
Model
Training
Spark
model
Model Results via
API
Data
Persisted:
HDFS/
Hbase/
Elastic/
etc
Exposed
model
(PMML)
Continuous
Integration
Model
via API
Model deployed/
replaced
Regional Systems
South Africa: Data Lake
Regional Reservoir
and Data
Warehouse
In Country
Feature
Extraction
Batch
Model
Training
Spark
model
Exposed
model
(PMML)
Continuous
Integration
• Model Training happens off SA Lake
• Africa Regional data ingested into SA Lake for Data Science
• Results can be made available to Africa Region Systems via API
• Can accommodate Batch and near real-time data science models
14. Roles and Responsibilities
Data Lake Interactions
Data
Scientist
DSC
Business
Requirements
idea
Data
Data Source
requirements
Model
Design Predictive
Model
Model Testing
Processing and
Vizualisation
Optimize Model
Model Optimization
Launch to
Production
Trigger Model
Productionization
Monitor
Production
Monitor Model
Production
Data
Engineer
DEV
Data
Data Source requirements
ETL development and
productionization
Launch to
Production
Serialize Model
Production
Move to Production
Platform
Infra
Engineer
OPS
Business
Requirements
Setup Project for
Data
Existing Data
Subscribe to
Existing Data
New Data
New Data Ingestion
Pipeline
Deployment and
Subscription
Monitor Resources
Access Tools and
Quota Resources
Launch to
Production
New Project
Deployment
Existing
Data
New Data
Source
Monitor
Production
Job Performances
Monitor
Production
Execution / Queues
Capacity
15. D
S
BDE
No of
Sources
Variety of
data types
Number
of Use
cases
Complexity
of use cases
% of
workflow
and
automation
Data
Science
technical
skill
BDE
BDE
D
S
D
S
D
S
D
S
D
S
D
S
D
S
D
S
D
S
D
S
D
S BDE
Backlog of
Prod
Velocity
Streaming
Data Science to BDE ratio
16. DRIVING PRINCIPLES
We are a community of like-minded professionals
and enthusiasts who share the common goal of
teaching, learning and shaping the future of Data
Science within Standard Bank Group.
Our focus is on building a local community of
practitioners that can effectively share knowledge,
best practices, and provide opportunities for
collaboration across business units and functional
areas. We seek to consolidate needs and
preferences for Data Science technologies across
individuals and teams, bringing a unified vision for
Data Science to our Big Data environments.
We share our thoughts and ideas. We work with
openness with the understanding that
advancement depends on collaboration and shared
learning.
2017 OFFICE HOLDERS
OBJECTIVES
• To advance Data Science principles across
lines of business, using a common practice
definition
• To provide guidance and direction to
practitioners
• Establish policies, standards and processes for
the application of Data Science use-cases on
shared production environments
• To socialise Data Science use-cases, success
stories, and stumbling blocks for shared
knowledge
• Provide professional education standards and
training pathways
MISSION STATEMENT
The Data Science Guild is a technical and data-savvy group
of practitioners discussing the application of artificial
intelligence and machine learning across the Standard
Bank Group.
We aim to guide, assist, and improve the development and
productionisation of machine learning algorithms and
statistical models on our Data Lake and Data Reservoir
environments.
FUNCTIONAL SCOPE
MEMBERSHIP PROPOSITION
Join the Data Science Guild in order to advance your
individual capabilities and to get exposure to an array of
use-cases and methodologies. Join one of the working
groups to directly contribute to shaping the practice of
Data Science within Standard Bank Group.
Get exposure to the developing curriculum of Data
Science training opportunities targeted to your
individual level of practice and expected business
application.
We run monthly meetings that include:
• Overview from working groups (education, tooling,
and productionisation standards)
• Demonstration(s) of use-cases from teams across
business units
• Connect sessions for practitioners to network across
teams.
The Data Science Guild is mandated by the Enterprise
Data Committee and forms part of the Data Community
of Practice. The Guild has the responsibility of
representing the Data Science professionals in the
Group and ensuring that they are equipped with the
education, tools and means by which Data Science
assets can defined, controlled, used and communicated
for the benefit of the Group and its component business
entities.
TERMS OF REFERENCE
Data Science Guild
16
Service Offering Supported Capabilities Owned Toolsets Service Offering
Business Data
Science Executive Education
Presentation
materials /collateral Business Data Science
Technical Data
Science Knowledge Sharing Code repository Technical Data Science
Operational Data
Science Data Science Tools
Data Science
Workbench
Operational Data
Science
Data Science Model
Productionisation
Defined
Productionisation
Standards
Education
Grad Training
Programme
Michelle Gervais Chair
Kristel Sampson Deputy Chair
TBD Membership
TBD General Professional Development / Events
TBD Working Group Lead: Productionisation
TBD Working Group Lead: Training Programme
18. Spark Serialisation
Deep Learning
model serialisation
Python
Serialisation J
S
O
N
Serialisation of different types of models need to be
investigated. One of the main goals of model
serialization is to have the ability to possibly embed a
model into a production system. Hence some of the
available options wrt serialization may not be viable –
Python. Having one standard like JSON may not be
possible either as it may not cater for complex models.
Model serialization requires unpacking and some
prototyping.
Model Serialization
19. What are the AI Use Cases?
One Size Fits All
The Data Services team is
building a suite of anomaly
detection models which will
solve for domains such as :
◂Software testing
◂Customer behavior monitoring
◂Trading patterns
◂Price formation
◂System performance (servers,
networks, OS, software)
◂Fraud detection
◂Customer support
Who Are You?
The Security team have
sponsored a facial recognition
engine to enhance security for
digital channels.
Leading research has excluded
an adequate corpus of African
faces, hence the need for a
custom solution
19
The Price is Right!
Markets have become more
interconnected and data-driven.
Using AI, the Global Markets
team is becoming more
efficient and competitive in our
market-making, risk
management, pricing and
execution. We use our data to
understand the actions of
markets participants and to
change the way we react to our
trading environments.
20. What are the AI Use Cases?
Shap! Eish! Hujambo!
Modern day research concerning sentiment
analysis and natural language processing
techniques have focused on the English
language. Our approach is to build models
based on vernacular languages in the regions
within which we operate.
20
Work Smarter, Not Harder
The Intelligence Automation
team have been automating a
number of business processes
including account origination.
The next generation pipeline
business processes to be
automated will include
embedded artificial cognition.
.
Show Your Money Who’s boss
Standard Bank is making your financial
management personal. Using the latest
machine learning techniques, we have
developed a prototype, commissioned by the
PBB SA Digital team, which produces an
accurate and customized forecast of a
customer’s upcoming transactions.
We Are All Connected
Using the power of distributed
computing, we are building
graph database capabilities to
connect our customer records
across independent databases
and systems.
The Nigeria Ecosystem model,
based on graph, is helping to
generate leads for CIB.
22. Our Next: Model Productionisation
Python/Spark
Machine Learning
Analytical
Data Science
Workbench setup with
Anaconda - Repo
Setup with Spark R –
but update required
to Python and
request for Sparklyr
and other
R
Statistical
modelling
Tensorflow (CPU)
Deep Learning
Setup Cluster – linked
to HDFS, running on
CPU not GPU
Graph
Entity Linking
Tactical – Neo4J
Installation per
user on edge node
23. Our Next
• Cloud
• Microservices Architecture
• Self Service (Business units enabled)
• Data Tokenisation
25. Icons are editable shapes.
This means that you can:
● Resize them without losing
quality.
● Change fill color and opacity.
● Change line color, width and
style.
Isn’t that nice? :)
Examples:
25
Notas del editor
Have separate GIT directories
Prod will only have access to Prod GIT folder
Dev will only have access to Dev Git directory
Blackduck to automate package refreshes
Have separate GIT directories
Prod will only have access to Prod GIT folder
Dev will only have access to Dev Git directory
Blackduck to automate package refreshes