This talk will detail the HSBC Big Data journey to date walking through the genesis of the Big Data initiative which was triggered by continual challenges in delivering data driven products. The global scale, diversity and legacy of an organization like HSBC presents challenges for Hadoop adoption not typically faced by younger companies. Big Data technologies are by their very nature disruptive to the established Enterprise IT environment. Hadoop and the peripheral toolsets in the big data ecosystem do not fit comfortably into an Enterprise Data Centre, IT Operational processes and can even prove disruptive to current organization structures. Alasdair will focus on the steps that HSBC has taken to mitigate concerns about Hadoop and raise awareness of the game changing benefits a successful adoption of the technology will bring. HSBC have taken an innovative approach to proving out the value of the technology engaging developers with a brakes off opportunity to use the platform and by placing Hadoop in a competitive scenario with traditional technologies. The Hadoop journey in HSBC was initiated in Scotland, blessed in London and proved out in China.
3. Business Context: HSBC (HSS) a business with a lot of data…..
Global Business
Global outsourcer of
investment operations
Active in 40+ countries
& jurisdictions
Over 150 operational
technology systems
Outsourcing is a
diverse and
incrementally complex
business
3 PUBLIC
4. Challenges in building Big Data Environments
ETL is a brittle 1 shot at success One version of the truth….
Design
Tight coupling to the relational model Any significant change initiates data migration Time
Source Integration Warehouse Division Marts Channels
Ops Product
Product Read
ODS ETL Product eCommerce
Trades Product
ETL
Position
ETL Enterprise
Logical Strategic Marts Analytical
Corp CMF
Actions Model Tools
Function
Function Read
External ETL Function
ETL Staging Function Reporting
Market Data
Client
Exchange Vertical Scale RDBMS struggle with scale out Multi-Marts increase duplication Run
Big Batch Appliances are uneconomic Cost increases with proliferation
Time
Time to Market: Months for any given slice, years in total
Total Cost: Any volume or low latency environment requires annual spend in the millions to 10’s of millions
4 PUBLIC
5. Building Big Data platforms has been an unhappy experience
Time to market has increased proliferation not consolidation
Delivery risk is high, as witnessed in industry wide failure rates
Ultimate Customer satisfaction is low, we often end up
answering yesterdays questions tomorrow
The economics of traditional technologies are against
proliferation of analytical platforms
– Costs increase with addition of data sources
– Costs of change increase with addition of data sources
Processing ceilings are reached quickly when adding newer
sources of data to traditional platforms
5 PUBLIC
6. Crisis of Supply and Demand, we need a new approach
High level requirements……
A single data platform that can provide 360 views of clients, operations and products
– Functionally the platform should support:
– Continual development, integration and deployment
– Parallel development streams
– Integration of poly-structured datasets
– Multi-views on single data sets
– ……..act as an ENABLER of change
– Non-functionally the platform should support:
– A low cost economic model for analytical platforms
– Scale to terabytes with high throughput ingest and integration
– Co-exist with our current estate
– Be accessible to business and technology teams
Enter Hadoop!
6 PUBLIC
7. Introducing any new technology to an enterprise
Adoption Lifecycle: Hadoop
Learn Plan Build
Proof Business Pilot Projects
Of Concept Value Strategic Stack
What have we done?
Whats left, whats next?
7 PUBLIC
9. Big Data Vision: The Agile Information Lifecycle
Data
Events
Discovery
Analytical
Blotters
Application
Map
Reduce Ingest
Processing
Insights rarely happen on the first query or build, more likely to occur after
several iterations on a dataset
9 PUBLIC
10. Hadoop Proof of Concept Scope: Gaungzhou China
Using Time to install Ease of
Performance
a vendor maintaining
Hadoop comparison
package the cluster
Developing Integration of Building Porting
existing applications existing code
on Hadoop databases on the cluster to Hadoop
Advanced Enhance an
Build out a
Development existing
Analytics skills levels analytics
new modelling
service
on Hadoop package
10 PUBLIC
11. Proof of Concept Results
Hadoop was installed and operational in a week
18 RDBMS Warehouse and Marts databases were ported to
Hadoop in 4 weeks
A existing batch that currently take 3 hours was reengineering
on Hadoop: Run Time 10 minutes
A current Java based analytics routine was ported onto Hadoop
increasing data coverage and reducing execution time
We lost the namenode and had to rebuild the cluster…..
11 PUBLIC
12. Hadoop Code Day: Gaungzhou China
We sponsored a 24 hour code competition
to allow the off-shore teams to show their
stuff
We had over 50 volunteers for the event
The volunteers were split into teams of 3
and given 24 hours to develop an
application using the Proof of Concept
cluster
1 weeks training was offered to the
participant on a casual basis
All the teams delivered…………
12 PUBLIC
13. Next Step: Planning
Adoption Lifecycle
Learn Plan Build
Proof Business Pilot Projects
Of Concept Value Strategic Stack
13 PUBLIC
14. Big Data Plan: Big Data Economics (names removed to protect the innocent)
14 PUBLIC
15. Hadoop Economics: Technology for Austerity
REVENUE
MARGIN
COST
Hadoop speaks to the economics of today
Growing product and capacity at the same time as increasing margin
15 PUBLIC
16. Generic HSBC Big Data Use Cases
Volume File Processing Big Warehouse Advanced Analytics
Characteristics Characteristics Characteristics
• High Volume, High Throughput • Multi-source warehouse analytics • Statistical modeling and what if
processing of legacy flat files, XML environment providing a single data analysis on group wide data across
or other structured and semi- platform across multiple business multiple business lines
structured data lines • Production of data derived products
• Integration of polystructred data
Current challenges Current challenges Current challenges
• Cost: High volumes processing • Time to Market: Data Warehouse / • Scale: Traditional Analytic Data
predominantly still reside on the MI projects have proved extremely platforms have only been able to
mainframe, making low complexity challenging to implement in HSBC scale on the vertical
processing expensive and in the Finance Industry in • Cost: The amount of compute
• Scale: the ability to grow out general power required to perform volume
mainframe capacity quickly is • Complexity: Data Integration of statistical operations is cost
limited, the ability to scale on even group standard systems has prohibitive
distributed platforms is limited proved difficult due the variety of • Fidelity: Analytical calculations are
data structures and content typically run on aggregate totals
• Latency: Real Time MI is still only leading to a disconnect between
available via reporting from source events and the derived conclusions
directly or decisions
.
Day 1 Value
Strategic Value
16 PUBLIC
20. Remaining Challenges: Big Data Operations
Big Data Operations Big Data Organisation Hype / Cynicism
Is Hadoop anti-virtualisation? Segregation of duties USE IT AS A POSITIVE!!!
High Availability / disaster Big Data doesn’t want a Place Big Data into a competitive
Recovery needs to improve separate app, database, os & situation against your existing
storage team. The platform Information Management
Security and data privacy demands skilled generalists technologies, if you can’t get the
concerns job done better/faster/cheaper
then alter your decision tree?
Data Federation
PUBLIC
20
21. The art of the possible in 24 hours…..
Hadoop excites……
Hadoop on iPad & Android
(and tires)
The Winners….
Hadoop on HTML5 & Flex
Hadoop & R for Portfolio Optimisation
21 PUBLIC
Notas del editor
In essence: We are a processor of other peoples dataChallengesNobody does data the same way, even in the same systemsDifferences are inDefinitionsFormatscontent
In essence: We are a processor of other peoples dataChallengesNobody does data the same way, even in the same systemsDifferences are inDefinitionsFormatscontent
Dedicated ETL is an expensive way of doing thingsBig RDBMS or dedicated appliances are expensiveMarts mart everywhereCONCLUSION: high volume or/and low latency is very expensive to runRESULT: People are becoming reluctant to invest in these platforms and are looking for a service that can start small and grow
The road to damascus…..Vision is HSS only at this point in timeThe search for an alternate way of doing things has led us to hadoopHadoop lowers the barrier to entry for compute style solution to data problemsCONCLUSION: We view Hadoop as THE future technology for data platformsRESULT: We have begun the tech adoption process in the bank
The road to damascus…..Vision is HSS only at this point in timeThe search for an alternate way of doing things has led us to hadoopHadoop lowers the barrier to entry for compute style solution to data problemsCONCLUSION: We view Hadoop as THE future technology for data platformsRESULT: We have begun the tech adoption process in the bank
Todays biggest business challenge: Information management currently representsAgility in delivering data integrationFlexibility to present multi-views of dataBiggest business opportunity: AnalyticsScenario modellingPortfolio efficiency measurementThese all require big compute
…..here’s what it looks likeWalk left to rightExplain Map ReduceContrast with the old way, our vision of the new wayEDW will be around for some time to come but will be gradually superceededMap Reduce will be implemented via high level languagesA single warehouse become achievableMarts are demised in favour of views onto the base dataThe value add will come via data discovery….iterative ETL…..hypothesis testingCONCLUSION: Hadoop brings massive compute levels to bear on these problems, affordably
The is the next generation ETLETL process become truly iterativeAccept that you will get it wrong the first time round, Hadoop make the penalty for failure minimalThe value add will come via data discovery….iterative ETL…..hypothesis testingCONCLUSION: ETL moves from brittle to bend don’t breakRESULT: In building your Big Warehouse adding additional data/systems/perspect is a low tax operation
Where we’ve got toGo through the key challengesCONCLUSION: It’s a journey, and we’re walking through it just nowRESULT: first 2 have been addressed, challenges remain
Where we’ve got toGo through the key challengesCONCLUSION: It’s a journey, and we’re walking through it just nowRESULT: first 2 have been addressed, challenges remain
….our experience wasA vendor Hadoop package makes sense to an organisation like usData loads tooks days not monthsWe were quickly able to automate the loadsUsed Apache tools onlyBONUS Calypso data…. New for HSSHACKATHONOpen invite to all markets staffObjective; to use Hadoop against the business use caseSet judging criteriaStraight 24 hours over a weekendCompetition Prizes Attended by nearly 60 staff, equal to 20% of our China office18 teams, 17 delivered Wining application was stunningCONCLUSION: Hadoop is a great functional fit for our business demandRESULT: High level of confidence around the technology
Todays biggest business challenge: Information management currently representsAgility in delivering data integrationFlexibility to present multi-views of dataBiggest business opportunity: AnalyticsScenario modellingPortfolio efficiency measurementThese all require big compute
Where we’ve got toGo through the key challengesCONCLUSION: It’s a journey, and we’re walking through it just nowRESULT: first 2 have been addressed, challenges remain
The is the next generation ETLETL process become truly iterativeAccept that you will get it wrong the first time round, Hadoop make the penalty for failure minimalThe value add will come via data discovery….iterative ETL…..hypothesis testingCONCLUSION: ETL moves from brittle to bend don’t breakRESULT: In building your Big Warehouse adding additional data/systems/perspect is a low tax operation
Where we’ve got toGo through the key challengesCONCLUSION: It’s a journey, and we’re walking through it just nowRESULT: first 2 have been addressed, challenges remain
Todays biggest business challenge: Information management currently representsAgility in delivering data integrationFlexibility to present multi-views of dataBiggest business opportunity: AnalyticsScenario modellingPortfolio efficiency measurementThese all require big compute
Where we’ve got toGo through the key challengesCONCLUSION: It’s a journey, and we’re walking through it just nowRESULT: first 2 have been addressed, challenges remain
Where we’ve got toGo through the key challengesCONCLUSION: It’s a journey, and we’re walking through it just nowRESULT: first 2 have been addressed, challenges remain
….our experience wasA vendor Hadoop package makes sense to an organisation like usData loads tooks days not monthsWe were quickly able to automate the loadsUsed Apache tools onlyBONUS Calypso data…. New for HSSHACKATHONOpen invite to all markets staffObjective; to use Hadoop against the business use caseSet judging criteriaStraight 24 hours over a weekendCompetition Prizes Attended by nearly 60 staff, equal to 20% of our China office18 teams, 17 delivered Wining application was stunningCONCLUSION: Hadoop is a great functional fit for our business demandRESULT: High level of confidence around the technology