Technologische mogelijkheden en GDPR, een continue clash? En hoe staat het met de het ethisch (her)gebruik van data? Leer in deze sessie van Rabobank’s Big Data journey en krijg inzicht in: organisatorische keuzes, data Lab technologie visie & data strategie, als enabler en accelerator van digitale innovatie en transformatie.
3. Why is getting value from data so hot?
Three main ingredients each grow exponential
The same goes for:
- Regulatory and Compliance impact
- Public discourse concerning ethical reuse of data
There’s something about data …
Exponential growth in computing powerExponential growth in annual data created Exponential growth in data scientists
4. IMPACT
Exploration Lab phase 1 Lab phase 2 Lab phase 3 Production
Feasibility Scalability ManageabilityOpportunity
Idea
Potential
PoC project team
• 3 Data Scientists (3 external)
• 0 Business Consultants
• 1 Data Engineer (0 external)
PoV project team
• 8 Data Scientists (8 external)
• 0 Business Consultants
• 2 Data Engineers(0 external)
Advanced Data Analytics team
• 10 Data Scientists (10 external)
• 3 Business Consultants (2 external)
• 3 Data Engineers(1 external)
Advanced Data Analytics team
• 10 Data Scientists (8 external)
• 6 Business Consultants (3 external)
• 3 Data Engineers(1 external)
DTO DS&BC team
• 15 Data Scientists (8 external)
• 6 Business Consultants (1 external)
• 3 Data Engineers(0 external)
DTO DS&BC team
• 20+ Data Scientists (0 external)
• 7 Business Consultants (1 external)
• 4 Data Engineers(0 external)
“Gartner says we
should have Hadoop.”
Exploration with basic
Cloudera stack
(10 datanodes)
Building capabilities
on adhoc Research &
Customer Journey
analysis
(17 datanodes)
Innovation in
technology & building
predictive applications
Multi tenant Lab.
First deployments to
production
(23 datanodes)
Auditable organization
& governance
DTO Data Lab
Evolution
4
2011 … 2012/2013 … 2014 … 2016 … 2017 2018 - 2020
5. The Data Lab is the Rabobank-wide analytics accelerator
for transforming data into business value by:
• providing a state-of-the-art data analytics infrastructure
• sharing expertise
• discovering new business patterns & predictions
• applying new data & analytics techniques in complex data projects
The Data Lab:
• Is in the production environment
• Uses Production data from in- or outside Rabobank
• Contains at least Data Factory like technologies
… and thus provides Rabobank with:
• A safe sandbox environment for:
• Experiments with data & new technologies
• Challenging and applying ethical considerations
• Determining Legal boundaries (GDPR, PCI DSS, etc.)
• Production-like development, test and acceptance
DTO Data Lab
Why & How
Mission statement
Working context
5
Data Research and Innovative development of information products,
in a safe sandbox environment, with no limitations to data or technologies.
Bonus: Decrease of time to market from idea to production
Goals & added values
6. Scale out because of maturity level
DTO Data Lab
6
2014 2015 2016 2017 2018 2019
20+ Data Scientists
Models in production
Deep learning model development
Multiple Data Scientists / project
10 Data Scientists5 Data ScientistsTeam size
Proof of concepts predictive models
One-off insights
Projects
Experiments
1 Data Scientist / projectProject size
7. Advanced DataAnalytics – Maturity model
Framework
Currently public available maturity models
• Focus on enterprise adoption of Big Data and Advanced Analytics
• Lack of possibility to simply assess maturity of Advanced Data Analytics (ADA) teams
• Loosely related to privacy regulations, ethics, governance and compliance
Distinguishing ADA team-factors*
(*) These 9 factors are a combination of experience in setting up an advanced analytics department and relevant information found in:
• DAMA DMBOK (version 2, chapters 14-15), which is partly based on the Data management Maturity Model (DMM)
• MIKE2.0 Information Maturity Model
• CMMI People Capability Maturity Model (P-CMM v2.0)
Department maturity
7
Scalability IT stack Workforce Capability Way of Working
IT Adoption Permanent staffing Governance
Data Data Scientist-Data Engineer ratio Attractive to work
8. 2020
2018
2014
2012
2011
Maturity Assessment DTO Data Lab [1/9]
Scalability IT stack
Preferred situation: The ADA team is working on a Massive Parallel Processing platform (like Hadoop), with a Shared Nothing database technology
underneath and can take advantage of the most mature advanced analytics services in the cloud or experiment with new technologies and/or services.
Due to privacy and security measures to be taken on sensitive data, it is expected that a full cloud migration will be preceded by a hybrid implementation
of on premise services combined with cloud services. Cost wise it is expected that services and applications will migrate to the cloud first, followed by data.
Department maturity
8
DTO Data Lab
5 Cloud based (or hybrid with on premise) MPP/SN* platform, native cloud analytic services.
4 On premise MPP/SN* platform, standardized science/analytics platform.
3 On premise MPP/SN* platform, no standardization science/analytics workbench.
2 On premise stand alone desktop science/analytics workbench.
1 On premise working with Excel.
(*) MPP/SN: Massive Parallel Processing Shared Nothing database technology
9. 2020
2019
2018
2017
2014
DTO Data Lab
5 Most historic datasets available in the organization are provisioned through a Data Lake, with quality
control implemented.
4 A Data Lake is setup in the organization for storage of historic raw and defined data, the ADA team
has full access to the Data Lake to get data to the (advanced) analytics environment.
3 Data provisioning to the (advanced) analytics environment is fully approved, automated
provisioning is scalable, impact of data quality issues is predictable.
2 Data availability is low, data provisioning is automated with ELT tool to the (advanced) analytics
environment. Impact of data quality issues is growing in the organization.
1 Data availability is low, provisioning to the (advanced) analytics environment is on adhoc basis.
Data quality issues can not be addressed.
Maturity Assessment DTO Data Lab [3/9]
Data
Preferred situation: There is only one important interface set up for the ADA team to get all quality controlled historic data through the Data Lake
into the (advanced) analytics environment. Historic datasets are defined according to the Data Preparation model and can either originate from
Rabobank applications and/or (purchased) external (open) data.
Department maturity
9
10. 2020
2019
2018
2014
2012
DTO Data Lab
5 Significant part of projects are of type PrescriptiveAnalytics (How can we make it happen?).
4 Significant part of projects are of type PredictiveAnalytics (What will happen?).
3 Significant part of projects are of type Diagnostic Analytics (Why did it happen?).
2 Significant part of projects are of type Data Research and DescriptiveAnalytics (What happened?).
1 Projects are simple enhancements of traditional BI.
Maturity Assessment DTO Data Lab [9/9]
Attractive to work
Preferred situation: Preferably there is a good mix of project types within the assignment portfolio, to make it interesting for Data Scientists and Data
Engineers to work in a Rabobank ADA team. Although it is to be expected that employees will leave a company after a few years, it should be considered
to put effort in maintaining a diverse and innovative project portfolio. This is one of the key factors for employees to make working within an ADA team
attractive enough, for Rabobank to benefit from the investment it has done in for instance onboarding and education.
Department maturity
10
12. High privilege = High responsibility
Data Lab working context
DTO Data Lab
12
13. 13
360˚
Users
Data
Security &
Risks
Privacy &
Ethics
Data Lab
High Level
Solution
Data Lab
Architecture
Information
product
Data
Transformation
code
Data Lab
platform
hosting
Advanced
Research &
Analytics
Model
development
Data
Lab
Data
Analytics
Architecture
Policies &
Standards
Services Products
API
(Containerized)
Applications
15. Functional representation of the Enterprise DataArchitecture
Data Analytics Architecture
15
Data Lake
Source
Events
History
raw data
Archive of
Defined
Information
product data
1
2
Transient Data store
Archive of
Defined data
(Standardized
& enhanced)
Information Factory
Information
product
3
Definition Factory
1
2
Data Lab
• Advanced Analytics & Intelligence (Data Research)
• End-to-end development of data transformations: ①/② and Information products: ③
Code- & Model Repository
16. High Level Solution (HLS)
DTO Data Lab
16*: FAM = Functional Application Management / TAM = Technical Application Management
IT
DTO DAI&A
Ownership & TAM*
COO IT
Infrastructure
HW & OS support
Data Lab
storage
HDFS
0,66 PB raw storage
SQL
14 TB raw storage
26 Physical servers
RHEL7 / 758 cores / 3,5 TB RAM
DC Boxtel
8Virtual servers
Windows Server 2016 / 96 cores / 0,5 TB RAM
DC Best & Boxtel
Data Lab
DTO DAI&A
Ownership & FAM*
PoC
Linux
Advanced Analytics cluster
Hadoop / Linux
Collaboration platform
Dataiku
Cloudera stack
Elastic
search
Microsoft
BI stack
Visualization tools
Alteryx / Tableau / Shiny
Analytics cluster
SQL / Windows
R
server
Microsoft
BI stack
Visualization tools
Alteryx / Tableau / Shiny
SPSS
17. Data Lab
Data Engineering
Data Science &
Advanced Analytics
Customer
Analytics
Business
Intelligence
BDM
DTO Data Lab Architecture
Data LabArchitecture vision and roadmap
2018-2019
• Expansion on premise cluster
• Migration to Informatica BDM for better
orchestration of data preparation processes
• Implementation of Dataiku to increase efficiency
of collaboration in complex Data Science projects
• Mid 2019 exploration of migration to cloud
2020
• Finish migration to cloud services
• End state: hybrid architecture, depending on
data sensitivity part of the work is done on
premise and the rest is done in the cloud
17
18. • Security context DTO Data Lab
• Secure Data transfer policy
• Way of working: Data Ingestion & Provisioning
• Data Management
• Data Strategy
• Data Preparation standard
• Way of working: Historic Data preparation
• Definitions of done
• Users & role based access
• Privacy & Ethics
Policies & Standards
18
19. Security & Risk context DTO Data Lab
Security
19
Outside - In:
e.g. outsider(s)
try to get (Wifi)
access to data in
Data Lab systems
Inside - Out:
e.g. employee
(un-)consciously
leaks data from Data
Lab systems outside
the Lab domain
Outside - In:
e.g. outsider(s) try to
get access to the
Rabobank office, or
personal workspace
Inside - Out:
e.g. employee
(un-)consciously
provides unautho-
rized access to the
Rabobank office
OutsideInside
ICT systems
Physical workspace
20. Data Strategy
For optimizing Data Science & Analytics productivity …
Data Management
20
Ideal data environment for Data Scientists and Analysts:
• A Data Lake filled with lots of historic datasets
• Preferably built up in event driven time series
• Completeness and quality is checked
• Insights in purpose and meaning of datasets is at hand
1. Filtering 2. Selection 3. Aggregation 4. Combination 5. Enrichment
Nowadays we can only provide datasets on which
the following is applied:
This calls for a simple Data Strategy, where:
• We will never throw data away (… of course within Record Keeping boundaries)
• Historic datasets are created according to one Data Preparation standard
Which comes with the need for a transition from planned design to adaptive design for Data Management.
Resulting in 70% inefficiency,
caused by fixing data issues.
Over and over again …
21. Data & Information foundations
Data Preparation standard
21
Data Lake
Source
Source
Source
Data foundation
• Source driven
• Grouped by Data domain
• No cross domain/source joins
• No selection, no filtering, no aggregations
• All rows and all columns
Information foundation
• Analytics/target/usage driven
• Grouped by product/service
• Cross domain/source joins allowed
• Selections, filtering, aggregations allowed
• No infinite number of precomputed
aggregations (BI cubes), rather subassemblies
enabling any possibly required aggregation
Data layers
• Single source single table data sets
• Produced according to Data Preparation standard
22. Data PreparationWay ofWorking
22
Raw Archive Defined per source Data product
A (preferably untouched) copy of the original
source data is ingested into the Data Lake.
Data provisioning is tightly governed by Data
Delivery Agreements.
Definition of Done:
Signed DDA from Source to Lake
Approved extraction code (EL)
Data profiling report
Data Quality report
Updated Data Catalogue
Up-to-date Data Lineage information
Brown data is technically standardized.
Conventions are synchronized with the Data
Catalog, enhancement of data is defined (e.g.
adjustment of missing/null values is described).
Definition of Done:
Standardization requirements documented
Standardization design documented
Standardization code saved in code repo
Test- & Acceptance report
Data profiling report
Data Quality report
Updated Data Catalogue
Up-to-date Data Lineage information
Blue data is enriched to meaningfull information
by applying business rules or advanced modelling
results.
Example: Historic Client-Product mapping
Definition of Done:
Signed DDA from Lake to Factory
Combination/enrichment design documented
Comb./enrichment code saved in code repo
Test- & Acceptance report
Data profiling report
Data Quality report
Updated Data Catalogue
Up-to-date Data Lineage information
23. Data Lab
Environment definitions (1/2)
The DTO Data Lab environment contains two types of technical environments.
o service different roles, projects and experiments in the Data Lab, a logical segregation can be set up to create silo’s based on
authorization profiles and sub-silo’s with data access control lists.
DTO Data Lab platform
23
Data Engineering
• Ingestion
• Preparation
• Provisioning
Data Science & Analytics P/A environment
Controlled run of
code & applications
by Data Engineering
team
DS&BC
team
NBA
team
EWS
team Risk team
AR
team
HRA&I
Team
RR
team
…
Experi-
ment 1
Experi-
ment 2
Advanced
Analyticscluster
Analytics
cluster
24. Introduction of the Business Data Committee
Data Governance Board
25
Driving & Developing
o BDC actively drives evolution in use of business data.
o The major business initiatives involving data & analytics are added
to the agenda of the BDC.
o Progress of these initiatives is monitored, requirements for data
quality and architecture are defined, synergy in data-tooling for
business is discussed.
o New business opportunities in data and possibilities of translating
Rabobank business ambitions into data needs and data
requirements are initiated
Governing
o BDC defines clear guidelines on the usage of data outside of the
original purpose and usage the data was collected for
o On the usage of data for these purposes, the BDC act as a decision
making board. Providing a clear body that actively defines
evolving boundaries on the extent of data usage (for business).
o The BDC acts as a forum for guarding, challenging and evolving
the companies ethical, privacy and legal appetite for data-usage
in business
25. PrivacyAssessment
Transparent and auditable reuse of data in the DTO Data Lab
Primary goals for the Data Privacy Assessment procedure are:
• Structured archiving of:
• all considerations made to decide if data can be processed
• relevant purpose limitations
• who was consulted
• date when the assessment took place
• etc.
• Increase awareness for all stakeholders concerning assessment
and guidelines for processing (privacy) sensitive data
• Create knowledge base for conditions to re-use Rabobank data,
possibly in combination with public data
Privacy policy
26
Process flow DTO Data Lab guideline for Big Data projects
26. Ethics Assessment
Ethics: “The discipline dealing with what is good and bad and with moral duty and obligation.”
• There is no single “right” ethical approach
Rabobank needs to come to some collective understanding of common principles and their application.
• Basic Rabobank Ethics principles to start with:
• We do not actively harm humans, their property, reputation, or employment by false or malicious action.
• We do not watch bad things happen without helping.
• We avoid real or perceived conflicts of interest whenever possible, and disclose them to affected parties when they do exist.
• We are honest and realistic in stating claims or estimates based on available data.
• We reject to bribery in all its forms.
• We treat all persons fairly and do not engage in acts of discrimination based on race, religion, gender, disability, age, national origin,
sexual orientation, gender identity, or gender expression.
• Ethics principles can and will change over time
Recommendation Closely watch results coming from the EU funded SATORI project
Rabobank Ethics principles need periodic refinement in the Ethics Committee, with input from the BDC and other relevant stakeholders
Ethics guideline
27
28. IMPACT
Exploration Lab Pilot Pre Production Production
Feasibility Scalability RealisabilityOpportunity
Idea
Potential
5% 15% 25% 55% 75% 100%
The backlog for ideas
and projects.
Exploring how to
approach the project.
First impact scans are
available.
Building the case, data
is collected and
processed. Research
and analysis is done.
As an example,
models are built and
validated.
Testing in limited
setting, to determine
whether the project
deliverable can be
taken to production.
Scale out of the pilot,
to arrange hand over,
training of the
business and close
down of the project.
The deliverable is
released to the
business owner,
including all
documentation
Wrap up …
Innovation Funnel approach
29
First draft
Data Lineage
Update
Data Lineage
Update
Data Lineage
Finish
Data Lineage
29. 30
The journey continues ...
Deep learning
Artificial Intelligence
Anonymization & Pseudonymization
Thank you!
Notas del editor
Bert van Rest: Doen jullie ook iets met geodata/gis tooling?
Dirk-Jan Broer: Interessegebied is vooral debiteurenmanagement in combinatie met betaalsystemen.
In onze doelstellingen zou ook moeten staan dat we maximaal transparant zijn in ons doen en handelen (ook op gebied van data verwerking).
Als iets mag van de wet, moeten ook de vragen worden gesteld: Willen we het ook? Past het binnen de doelstellingen van Rabobank en is het ethisch te verantwoorden?
De afgelopen 10 min heb ik jullie in vogelvlucht meegenomen in de wereld van Robotics, de toepasbaarheid en de ontwikkelingen hiervan binnen Operations.
Ik hoop dat ik bij jullie een mooi vuurtje heb aangewakkerd waar we als organisatie profijt van gaan hebben. De urgentie om te verbeteren is hoog. Zie dit onderwerp als een mooie kans om jouw eigen team verder te brengen, nu heb je de mogelijkheid om aan te haken.
Wij zijn ons als team heel goed bewust dat we jullie hard nodig hebben om dit een succes te maken. Ik nodig jullie dan ook van harte uit om je vragen, ideeën en opmerkingen te delen. Dit kan met mij of één van de andere collega’s rechts in het blok, die hier reeds bij betrokken zijn.
Bedankt voor jullie aandacht!!