Advanced Analytics in Banking, CITI

Advanced Analytics in Banking
Juan M. Huerta
Global Decision Management
VP Advanced Analytics
Citibank

I will talk about…
• Big Data Adoption process at Citi
• Realizing the Technical Value of Big Data
• Global Solutions
1

140
countries2
200 million
accounts

Citi: A Customer Centered Organization
3
As a customer-centered bank, the goal of our Big Data strategy to shift
the focus from independent vertical silos to Common Horizontal Solutions
focused around Citi’s 200-million customer accounts

Big Data Adoption Stakeholders
• Lines of Business
• Strategy & Decision Management Organizations: cross LOB & Geo,
global
• Data innovation Office: Governance & Regulatory
• CitiData – Big Data & Analytics Engineering
4

Big Data Adoption Roadmap
5
Adoption will not occur at once. The level of capability maturity across the
organization will vary significantly.
On theory we think in terms of Staged Competencies of a Big Data
Maturity Model.
In practice, a hybrid process, which fits the level of maturity of
participants, is needed.
Common
Data
Common
Analytic
Platform
Common
Tools &
Techniques
Common
Solutions
Common
Focus
Strategy

Big Data Adoption Hybrid Participation Model
• Novice: Proof of Concept
• Expert: R&D Environment
• Shadowed
6

7
End-to-end Analytic Process for a POC Project
This is one component of the hybrid model
Ideas and
Hypotheses
Information Asset
Inventory
Navigator
(“IAIN”)
• Pipeline of ideas
to use data for
competitive
advantage
• Robust,
comprehensive
ontology
allowing analysts
and economists
to search, sort,
and select data
for analysis
• Preliminary
assessment
for business
value, data
safekeeping
and
alignment to
business
practices
Data
Transformation &
Provisioning
• Transformation rules
executed to
normalize and
conform production
data
• Conformed data set
made available in
production
environment
Production Model
Development
• Develop scalable,
productizable
analytics
Model
Deployment
• Exploit insights and
analyses across the
enterprise to
maximize value
• Models measured
for quality / usage
• Formal approval
process through
Business
Steering
Committee
based on
understanding
expected use of
production data
R&D process
R&D
Project
Approval
Product
Approval
Engineering / Production process
Analytics
Knowledge
Management
• Robust, compreh
ensive ontology
allowing analysts
and economists
to
search, sort, an
d select data for
analysis
Data Set
Preparation
&
Provisioning
• Basic preparation
of data set (e.g.,
consolidation,
conformation)
• Permission-based
provisioning of
data set into a Big
Data Analytics
environment
Analytics
Execution
• Advanced
analytic tools
mine business
insight from
large volumes of
data
• Data scientist
peers review
model findings
and results
Analytics Peer
Review
Data
Acquisition
• Where
necessary,
acquire new
data sets to
support R&D
project

Advanced Global Solutions
• A global solution is a tested algorithm or analytic model that carries
out a particular business analysis and which is leveraged at a global
scale
• A big data global solution enables the interplay of complex algorithms
and large datasets
• When a global solution is built upon big data approaches a delivery
roadmap should be considered
• In the exploratory process a Global Solution is developed in the
Innovation R/D environment and validated through a POC process
• Alignment with Innovation, UAT, PRD environments
8

Technical Value of Big Data:
Benchmarks and Analysis

The Boom Driving Big Data is Technological
Heebyung Koh , Christopher L. Magee
A functional approach for studying technological progress:
Extension to energy technology
Technological Forecasting and Social Change, Volume 75, Issue 6,
July 2008, Pages 735–758

The Quadrant Of Analytic Opportunity
Run Time is affected by Data Size and Algorithmic Complexity
Algorithmic Complexity
Database
Interaction
Mtg+Cards+
Banking
Accounts Transaction
features
Accounts Transactions
Branches Transactions
Accounts Summary Stats.
Employees Summary Stats.
GL-GOCS GL-Entries
Branches Summary Stats.
10^10
10^9
10^9
10^8
10^7
10^6
10^5
Data Size
Sequence
Mining
Predictive
filtering
Latent
Dirichlet Allocation
HMM Baum-
Welch
O(ns nf nt)
CART
O(nf ns log ns)
Iterative
SVD- CF
K-means
Logistic
Regression
PCAPage
Rank
Self-Org.
Maps
Neural Nets
Collaborative
Filtering
(CF)
Vector based
Approaches
HMM
Machine
Learning
Traditional
Statistical
Big Data/Pattern
Mining
Conditional
Random
Fields
Support Vector
Machines

Breaking down the gains of P13n:
A Controlled Incremental Benchmark on a
Workstation grade processor (x500)
Implemented an incremental-SVD (Netflix Cup) predictive model that
runs on midsize of datasets…
X30
• Compiled Code (vs. interpreted)
x4
• In Memory (vs. Disk access)
X3.12
• Multithread (vs. single thread)
X1.3
• Workstation grade processor

Basic Map Reduce Benchmarks
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6
Series1
Impact of overhead as function
Of input volume:
Relative Map Throughput
as a function of # Mappers
0
5
10
15
20
25
0 5 10 15 20
RelativeMapCPUtimespeedup
Number of Maps
0.003351955
0.032258065
0.319148936 1
2.631578947
21.12676056
Linear (0.003351955
0.032258065
0.319148936 1
2.631578947
21.12676056)
0
200
400
600
800
1000
1200
1400
1600
0 5 10 15 20
TokensperWallClockSecond
Number of Maps
Series1
Linear (Series1)

HAMSTER: Hadoop Multi-signature Search
for Text-based Entity Retrieval
• Core algorithm: String Edit Distance O(mnk2)
• Baseline runs at 100 matches per day
• HAMSTER speedup: 33x (5 node speedup) 60x (java speedup) =
2000x faster
Source
Items
Target
Items
Source
items
per
target
Input
Size
MAP
Records
Cluster
Max Map
Tasks
Effective
Map
Tasks
CPU
map
(secs)
Wall time
34k 618k 100 4.40GB 345 33 33 196k 2h 14
secs
34k 618k 50 8.8GB 690 40 66 196k 1h
47min
34k 618k 30 14.6GB 1,149 40 110 199k 1h 39
min

Leveraging Global Big Data Global Solutions

Creating Global Big Data solutions
Our goal is to evolve from Big Data algorithms to Big Data
Solutions

Example of Advanced Global Solution Matrix
17
Outlier
Detection
Multivariate
Segmentation
Sequence
Matching
Network
Analysis
Customer Contextual Clickstream
Action Marketing Risk/Fraud Digital
Structured
Prediction
17
K-Medoids
Clustering

Example: Transactional Time Series
AnomalousBehavior

On Demand Simulation: Generate Branches’ DNA
• Case Scenario: Unusual number of cash advances by 2 tellers.
Single day fraud Multi day fraudOriginal branch (August)

Creating Regions of Interest based on
On-Demand-Simulation
Minimum-Spanning-
Tree based branch
association for region
of interest generation
Multi-day fraud simulation
Original branch
Region of interest
• Numbers shown
are randomized
indices

Conclusion: Lessons Learned
• One Size does not fit all
• Follow a Hybrid Approach
• Leverage Analytic patterns: Global Solutions
• Big Data is about Parallelization
• The future: expensive Algorithms applied to large datasets
• Global Solutions are the combination of algorithmic building blocks
applied to specific business problems
21

Advanced Analytics in Banking, CITI

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Advanced Analytics in Banking, CITI

Similar a Advanced Analytics in Banking, CITI (20)

Más de Innovation Enterprise

Más de Innovation Enterprise (20)

Último

Último (20)

Advanced Analytics in Banking, CITI