In this presentation Juan M. Huerta talks about big data adoption process at Citi, realising the technical value of big data and global solutions. Huerta goes on to talk about following a hybrid approach, and the future of analytics, expensive algorithms applied to large datasets. With Citi using these approaches in hopes of getting even wider global recognition.
4. Citi: A Customer Centered Organization
3
As a customer-centered bank, the goal of our Big Data strategy to shift
the focus from independent vertical silos to Common Horizontal Solutions
focused around Citi’s 200-million customer accounts
5. Big Data Adoption Stakeholders
• Lines of Business
• Strategy & Decision Management Organizations: cross LOB & Geo,
global
• Data innovation Office: Governance & Regulatory
• CitiData – Big Data & Analytics Engineering
4
6. Big Data Adoption Roadmap
5
Adoption will not occur at once. The level of capability maturity across the
organization will vary significantly.
On theory we think in terms of Staged Competencies of a Big Data
Maturity Model.
In practice, a hybrid process, which fits the level of maturity of
participants, is needed.
Common
Data
Common
Analytic
Platform
Common
Tools &
Techniques
Common
Solutions
Common
Focus
Strategy
7. Big Data Adoption Hybrid Participation Model
• Novice: Proof of Concept
• Expert: R&D Environment
• Shadowed
6
8. 7
End-to-end Analytic Process for a POC Project
This is one component of the hybrid model
Ideas and
Hypotheses
Information Asset
Inventory
Navigator
(“IAIN”)
• Pipeline of ideas
to use data for
competitive
advantage
• Robust,
comprehensive
ontology
allowing analysts
and economists
to search, sort,
and select data
for analysis
• Preliminary
assessment
for business
value, data
safekeeping
and
alignment to
business
practices
Data
Transformation &
Provisioning
• Transformation rules
executed to
normalize and
conform production
data
• Conformed data set
made available in
production
environment
Production Model
Development
• Develop scalable,
productizable
analytics
Model
Deployment
• Exploit insights and
analyses across the
enterprise to
maximize value
• Models measured
for quality / usage
• Formal approval
process through
Business
Steering
Committee
based on
understanding
expected use of
production data
R&D process
R&D
Project
Approval
Product
Approval
Engineering / Production process
Analytics
Knowledge
Management
• Robust, compreh
ensive ontology
allowing analysts
and economists
to
search, sort, an
d select data for
analysis
Data Set
Preparation
&
Provisioning
• Basic preparation
of data set (e.g.,
consolidation,
conformation)
• Permission-based
provisioning of
data set into a Big
Data Analytics
environment
Analytics
Execution
• Advanced
analytic tools
mine business
insight from
large volumes of
data
• Data scientist
peers review
model findings
and results
Analytics Peer
Review
Data
Acquisition
• Where
necessary,
acquire new
data sets to
support R&D
project
9. Advanced Global Solutions
• A global solution is a tested algorithm or analytic model that carries
out a particular business analysis and which is leveraged at a global
scale
• A big data global solution enables the interplay of complex algorithms
and large datasets
• When a global solution is built upon big data approaches a delivery
roadmap should be considered
• In the exploratory process a Global Solution is developed in the
Innovation R/D environment and validated through a POC process
• Alignment with Innovation, UAT, PRD environments
8
11. The Boom Driving Big Data is Technological
Heebyung Koh , Christopher L. Magee
A functional approach for studying technological progress:
Extension to energy technology
Technological Forecasting and Social Change, Volume 75, Issue 6,
July 2008, Pages 735–758
12. The Quadrant Of Analytic Opportunity
Run Time is affected by Data Size and Algorithmic Complexity
Algorithmic Complexity
Database
Interaction
Mtg+Cards+
Banking
Accounts Transaction
features
Accounts Transactions
Branches Transactions
Accounts Summary Stats.
Employees Summary Stats.
GL-GOCS GL-Entries
Branches Summary Stats.
10^10
10^9
10^9
10^8
10^7
10^6
10^5
Data Size
Sequence
Mining
Predictive
filtering
Latent
Dirichlet Allocation
HMM Baum-
Welch
O(ns nf nt)
CART
O(nf ns log ns)
Iterative
SVD- CF
K-means
Logistic
Regression
PCAPage
Rank
Self-Org.
Maps
Neural Nets
Collaborative
Filtering
(CF)
Vector based
Approaches
HMM
Machine
Learning
Traditional
Statistical
Big Data/Pattern
Mining
Conditional
Random
Fields
Support Vector
Machines
13. Breaking down the gains of P13n:
A Controlled Incremental Benchmark on a
Workstation grade processor (x500)
Implemented an incremental-SVD (Netflix Cup) predictive model that
runs on midsize of datasets…
X30
• Compiled Code (vs. interpreted)
x4
• In Memory (vs. Disk access)
X3.12
• Multithread (vs. single thread)
X1.3
• Workstation grade processor
14. Basic Map Reduce Benchmarks
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6
Series1
Impact of overhead as function
Of input volume:
Relative Map Throughput
as a function of # Mappers
0
5
10
15
20
25
0 5 10 15 20
RelativeMapCPUtimespeedup
Number of Maps
0.003351955
0.032258065
0.319148936 1
2.631578947
21.12676056
Linear (0.003351955
0.032258065
0.319148936 1
2.631578947
21.12676056)
0
200
400
600
800
1000
1200
1400
1600
0 5 10 15 20
TokensperWallClockSecond
Number of Maps
Series1
Linear (Series1)
15. HAMSTER: Hadoop Multi-signature Search
for Text-based Entity Retrieval
• Core algorithm: String Edit Distance O(mnk2)
• Baseline runs at 100 matches per day
• HAMSTER speedup: 33x (5 node speedup) 60x (java speedup) =
2000x faster
Source
Items
Target
Items
Source
items
per
target
Input
Size
MAP
Records
Cluster
Max Map
Tasks
Effective
Map
Tasks
CPU
map
(secs)
Wall time
34k 618k 100 4.40GB 345 33 33 196k 2h 14
secs
34k 618k 50 8.8GB 690 40 66 196k 1h
47min
34k 618k 30 14.6GB 1,149 40 110 199k 1h 39
min
20. On Demand Simulation: Generate Branches’ DNA
• Case Scenario: Unusual number of cash advances by 2 tellers.
Single day fraud Multi day fraudOriginal branch (August)
21. Creating Regions of Interest based on
On-Demand-Simulation
Minimum-Spanning-
Tree based branch
association for region
of interest generation
Multi-day fraud simulation
Original branch
Region of interest
• Numbers shown
are randomized
indices
22. Conclusion: Lessons Learned
• One Size does not fit all
• Follow a Hybrid Approach
• Leverage Analytic patterns: Global Solutions
• Big Data is about Parallelization
• The future: expensive Algorithms applied to large datasets
• Global Solutions are the combination of algorithmic building blocks
applied to specific business problems
21