Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financial Applications

DataWorks Summit - Breakout Session
Disaster Recovery Experience at CACIB:
Hardening Hadoop for Critical Financial Applications
March 21st, 2019
Abdelkrim HADJIDJ – Cloudera
Mohamed Mehdi BEN AISSA – CA-GIP

Speakers
Mohamed Mehdi BEN AISSA
Big Data Technical Architect at CA-GIP
Big Data Infrastructure Technical Owner for CA-CIB
Abdelkrim HADJIDJ
Solution Engineer at Cloudera

Agenda
Big Data at CA-GIP & CA-CIB
Disaster Recovery Strategies
Stretch Cluster : Architecture & Configuration
Questions & Answers

Big Data at
CA-GIP & CA-CIB
15
Infrastructure B&R
Big Data Experts
Big Data
Run Team
Big Data
BuildTeam
Big Data
Storage
8PB
2019 80% 1500
CA Group
Infrastructure
Collaborators Sites in FranceCreation
Date
17
8000
CollaboratorsThe world's n°13
bank *
13 36
Locations
around World
* In 2017, measured by Tier One Capital
36TB of Memory
4000 Cores

Big Data at CA-GIP & CA-CIB : Use Cases
Risk
Management
Decision
Making
Cash
Management
Regulations

Big Data at CA-GIP & CA-CIB : Principal Use Cases
Risk Management/ Regulation
• Aims to replace the current market risk eco-system and phase out the legacy system
(over 10 applications to decommission) to provide the bank with a golden source on
deal & risk indicators across business lines and worldwide
• Address ongoing and future regulations (LBF/Volker rules, FRTB, BCBS239, Initial
Margin, Stress EBA/AQR …)
• 3PB of Data on Production to date
Cash Management Transformation
• Strategic program for CA-CIB new business
• Real time Transaction Processing
• Redesign the SI payment for CACIB and international deployment
• Target : 800 millions transactions/day (8 TB/day)
Data-Lake
Real Time
Processing

Big Data at CA-GIP & CA-CIB : Service Offer Architecture
ACCESSPROCESSINGINGESTION
Scheduling, Security, Monitoring & Administration
STORAGE &
MESSAGING
DATA SOURCES
Data storage
Messaging
Batch processing
Stream
Processing
App 1
App 2
…
App n
Records
Documents
Files
Messages
Streams
Dataviz Data Governance
APPLICATIONS
Batch Mode
Stream Mode
Data query (SQL)
NoSQL Database
Indexed Data
OLAP
RAW DATA ENHANCED DATA
OPTIMIZED DATA
RAW DATA
ENHANCED DATA
OPTIMIZED DATA

Big Data at CA-GIP & CA-CIB : Service Level Agreements
Disaster Recovery Performance Security
Resiliency
Service Availability
24/24 7/7
Zero Data Loss
Distributed Systems
Scalability
Data Locality
In-Memory Processing
Authentication
Authorization
Data Protection
Audit

Disaster Recovery Strategies

Disaster Recovery vs Backup vs Archive
Disaster Recovery (DR)
• Protects from the complete outage of a data center (eg. Natural disaster)
• Disaster Recovery includes replication, but also incorporates failover and failback
• Disaster Recovery Site can be an on-premise or cloud cluster
Backup / Restore
• Protects against the logical errors (e.g. accidental deletion, corruption of data, etc)
• Incremental/full backup mechanisms are required to restore data from previous Point
In Time version (PIT). This usually involves a snapshot mechanism for PIT protection.
• Backups/Snapshots are kept for relatively short time (from days to months)
Archive
• A single static copy of data for long-term preservation (several years)
• This is required by some regulations

Objective of a Disaster Recovery plan
• SLA (Service-Level Agreement) : Particular aspects of the service (quality, availability,
responsibilities) :
• RTO (Recovery Time Objective) : acceptable service interruption measured in time
• RPO (Recovery Point Objective) : maximum acceptable amount of data loss measured in
time
€
Minimize service
interruption (RTO)
Minimize data
Loss (RPO)
Reduce Costs Guarantee
Consistency
Optimize
Performance

DR options
Node
Node
Node
Node
Node
Node
Node
Node
DC1 DC2
Data
Node
Node
Node
Node
Node
Node
Node
Node
DC1 DC2
Data
Node
Node
Node
Node
Node
Node
Node
Node
DC1 DC2
Data
Dual ingest
Low RPO/RTO
Mirroring
High RPO/RTO
Multiple DC
Low RPO/RTO
Node
Node
Node
Node
DC3

Dual ingest
DR Cluster
PROD Cluster
Synchronicity Checks / Checksums
Pub-sub/
Streaming / Batch
Routing
Data sources Global
Traffic
Manager
Local Traffic
Manager
Local Traffic
Manager
End Applications/
Users
• Significant investment
• Might meet RPO=0 (in sync)
• Active/active site

Dual ingest pros and cons
Pros
• Very low RPO/RTO (almost 0)
• Dual run makes failover and failback
easier
• Easy to implement from an
infrastructure standpoint. Tools like
NiFi or Kafka make implementation
easier
• Help detect application’s bugs/errors
(except ML)
Cons
• Requires two clusters with preferably
iso-resources
• Requires dual configurations injections
(and automation)
• Impact on applications makes
implementation complex (ex self
service)
• Requires a cluster diff implementation
• Data export should be run once

Mirroring
Raw Data Ingest
Replicated Data
PROD Cluster DR Cluster
Pub-sub/
Streaming / Batch
Routing
Global
Traffic
Manager
Local Traffic
Manager
Local Traffic
Manager
End Applications/
Users
• Can meet RPO = 1h to24 hrs
• Active/passive site
Data sources

Mirroring pros and cons
Pros
• Loose requirements, easy to
implement
• Big Data technologies are designed
for this architecture
• Better performance (throughput,
network, latency)
• Can support other use cases
(isolation, geo-locality, legal, etc)
Cons
• Requires two clusters
• High RPO: Potential data loss (asynch
replication) that could be recovered
from the source
• Require a replication layer
• Need to define fail-over/fail-back
logic and process that goes beyond
just data

Things to consider for mirroring
Applications
(Spark jobs, Hive queries, Zeppelin
notebooks, etc)
Data
(HDFS Files, Hive tables, Kafka
msgs, etc)
Infrastructure
(network, hardware, etc)
Configurations
(OS, Binaries, Ambari, Agents, RPM,
etc)
Process
(SLAs, Business
Continuity, Dev, etc)
Metadata
(Atlas, Ranger, Topics, etc)
Client configurations
(BI tools, Hbase client, Rest API, etc)
Infrastructure
services
(LDAP, AD, LB, etc)

Replication tools
DN
DNDN
DN
Inotify events
DN
DNDN
DN
NN
HDFS data
HDFS data
NN

What RPO can we realistically target?
We can achieves smaller replication frequency and better RPO (ex. 10 mins) – but
this depends on several parameters
Data volume, Data burst, # of
partitions/files/tables, Insert vs
update ratio
Internal/external bandwidth,
latency, dedicated/shared
(day/time), CPU **
** Asynchronous: RPO = F( max(data_generation_rate), available_bandwidth )
* Synchronous: very slow RPOs by throttling writes (impact on performances)
InfrastructureData
Synchronicity*, Incremental
replication, latency (snapshots,
compression, encryption, integrity)
Software

Spanning Multiple Data Centers
Data sources
Raw Data Ingest
DC1 DC2
DN
NN1
ZK1 JN1
DN
NN2
ZK2 JN2
Traffic
Manager
End Applications/
Users
DC3 (witness)
ZK3 JN3
• Restricted to data centers within a geographic region
(few km).
• Strong constraints: 3 DCs, single digit ms latency,
guaranteed bandwidth *
• Multi-DC is not native in Hadoop
* https://www.cloudera.com/documentation/other/reference-architecture/PDF/cloudera_ref_arch_metal.pdf

Multiple Data Centers pros and cons
Pros
• Better RPO (synch replication)
• Cheaper, it’s just one cluster
• Simpler for applications
• No need for fail-over/fail-back
Cons
• Strong constraints: nearby 3 DCs, single
digit ms latency, guaranteed bandwidth *
• Advanced configurations: replicas
placement strategy, Yarn labels, etc
• Performance impact by inter DC network
• Not suited for all the animals in the Zoo
(ex. Streaming)
* https://www.cloudera.com/documentation/other/reference-architecture/PDF/cloudera_ref_arch_metal.pdf

Stretch Cluster : Architecture & Configuration

Stretch Cluster : Why !?
responsibilities) :
• RTO (Recovery Time Objective) : The targeted duration of time and a service level within
which a business process must be restored after a disaster
• RPO (Recovery Point Objective) : The maximum targeted period in which data might be
lost
• Goals :
24/7 RPO €
RTO->0 RPO=0 Reduce Costs Consistency Performance

Stretch Cluster : Why !?
responsibilities) :
• RTO (Recovery Time Objective) : The targeted duration of time and a service level within
which a business process must be restored after a disaster
• RPO (Recovery Point Objective) : The maximum targeted period in which data might be
lost
• Goals :
24/7 RPO €
RTO->0 RPO=0 Reduce Costs Consistency Performance
Financial Context

Stretch Cluster : Architecture
Control NodesControl Nodes
Gateway Node
Witness Nodes
Master Nodes
Worker Nodes
Gateway Node
DC1 DC2
DC3

Stretch Cluster for HDFS: Architecture & Configuration

Stretch Cluster : HDFS Architecture
DC1 DC2
DC3
Rack 1 Rack 2 Rack 3 Rack 4
Datanode 1
Datanode 2
Datanode 3
Datanode 4
Datanode 5
Datanode 6
Datanode 7
Datanode 8
Datanode 9
Datanode 10
Datanode 11
Datanode 12
ZK + JN + NN ZK + JN ZK + JN + NN ZK + JN
ZK : Zookeeper
JN : JournalNode
NN : NameNode
ZK + JN
Inter-DCs Link
Bandwidth : 100 Gbits/s, Latency < 1ms

Stretch Cluster : HDFS Architecture – Before Rack (One-Layer) Awareness
DC1 DC2
DC3
Datanode 1
Datanode 2
Datanode 3
Datanode 4
Datanode 5
Datanode 6
Datanode 7
Datanode 8
Datanode 9
Datanode 10
Datanode 11
Datanode 12
ZK + JN
B1
B1
B1
B1
2 replicas per DC / 1 replica per Rack
Inter-DCs Link
ZK : Zookeeper
JN : JournalNode
NN : NameNode

Stretch Cluster : HDFS Architecture – After Rack (One-Layer) Awareness
DC1 DC2
DC3
Datanode 1
Datanode 2
Datanode 3
Datanode 4
Datanode 5
Datanode 6
Datanode 7
Datanode 8
Datanode 9
Datanode 10
Datanode 11
Datanode 12
ZK + JN
B1
B1
B1B1
Rack Awareness Configuration
/dc1/rack1 /dc2/rack3/dc1/rack2 /dc2/rack4
1
Inter-DCs Link
ZK : Zookeeper
JN : JournalNode
NN : NameNode

DC1 DC2
DC3
Datanode 1
Datanode 2
Datanode 3
Datanode 4
Datanode 5
Datanode 6
Datanode 7
Datanode 8
Datanode 9
Datanode 10
Datanode 11
Datanode 12
Inter-DCs Link
ZK + JN
B1
B1
B1B1
1
ZK : Zookeeper
JN : JournalNode
NN : NameNode

DC1 DC2
DC3
Datanode 1
Datanode 2
Datanode 3
Datanode 4
Datanode 5
Datanode 6
Datanode 7
Datanode 8
Datanode 9
Datanode 10
Datanode 11
Datanode 12
ZK + JN
B1
B1
B1B1
1
Inter-DCs Link
HDFS (Default) Block Placement Strategy :
• One replica on local Node
• Second replica on a remote Rack
• Third replica on same remote Rack
• Additional replicas are randomly placed
ZK : Zookeeper
JN : JournalNode
NN : NameNode

Stretch Cluster : HDFS Architecture – Advanced Configuration (Two-Layers Awareness)
Topology (Data Center) Awareness & advanced Replicator
• core-site.xml
• net.topology.impl -> org.apache.hadoop.net.NetworkTopologyWithNodeGroup
• net.topology.nodegroup.aware -> true
• dfs.block.replicator.classname-> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithNodeGroup
Adjust Timeouts (RTO -> 0)
• core-site.xml
• dfs.heartbeat.interval
• dfs.namenode.heartbeat.recheck-interval
Recovery from Close Failure (DFSOutputStream)
• hdfs-site.xml
• dfs.client.block.write.replace-datanode-
on-failure.best-effort -> true
• dfs.client.block.write.replace-datanode-
on-failure.enable -> true
2
3 4

Stretch Cluster : HDFS Architecture – After Rack Awareness & Advanced Configuration
DC1 DC2
DC3
Datanode 1
Datanode 2
Datanode 3
Datanode 4
Datanode 5
Datanode 6
Datanode 7
Datanode 8
Datanode 9
Datanode 10
Datanode 11
Datanode 12
ZK + JN
B1 B1 B1 B1
Inter-DCs Link
1 replica per Rack / 2 replicas per DC
ZK : Zookeeper
JN : JournalNode
NN : NameNode

Stretch Cluster : HDFS Architecture – Failover Management
DC1 DC2
DC3
Datanode 1
Datanode 2
Datanode 3
Datanode 4
Datanode 5
Datanode 6
Datanode 7
Datanode 8
Datanode 9
Datanode 10
Datanode 11
Datanode 12
ZK + JN
B1 B1 B1 B1
Inter-DCs Link
ZK : Zookeeper
JN : JournalNode
NN : NameNode

Stretch Cluster : HDFS Architecture – Failover Management
DC1 DC2
DC3
Datanode 1
Datanode 2
Datanode 3
Datanode 4
Datanode 5
Datanode 6
Datanode 7
Datanode 8
Datanode 9
Datanode 10
Datanode 11
Datanode 12
ZK + JN
B1 B1 B1 B1
Keep Only 2 replicas per DC
Inter-DCs Link
ZK : Zookeeper
JN : JournalNode
NN : NameNode

Stretch Cluster for YARN: Architecture & Configuration

Stretch Cluster : YARN Architecture
DC1 DC2
DC3
NodeManger 1
NodeManger 2
NodeManger 3
NodeManger 4
NodeManger 5
NodeManger 6
NodeManager 7
NodeManager 8
NodeManager 9
NodeManager 10
NodeManager 11
NodeManager 12
ZK + RM ZK ZK + RM ZK
ZK : Zookeeper
JN : JournalNode
RM : ResourceManager
ZK
Inter-DCs Link

Stretch Cluster : YARN Architecture – Advanced Configuration
Topology (Data Center) Awareness : additional layer with node & Rack
• Yarn-site.xml
• org.apache.hadoop.mapreduce.v2.app.rm.ScheduledRequestsWithNodeGroup->
net.topology.with.nodegroup
• yarn.resourcemanager.scheduler.elements.factory.impl->
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerElementsFactoryWithNodeGroup
Adjust Timeouts (RTO -> 0)
• core-site.xml
• ipc.client.connection.maxidletime
• yarn-site.xml
• yarn.nodemanager.health-checker.interval-ms
• yarn.nm.liveness-monitor.expiry-interval-ms
1
2

Stretch Cluster : YARN Architecture – Before Node Labels
DC1 DC2
DC3
NodeManger 1
NodeManger 2
NodeManger 3
NodeManger 4
NodeManger 5
NodeManger 6
NodeManager 7
NodeManager 8
NodeManager 9
NodeManager 10
NodeManager 11
NodeManager 12
ZK
A1
A1A1
A1
Inter-DCs exchange Optimization
Inter-DCs Link
ZK : Zookeeper
JN : JournalNode

Stretch Cluster : YARN Architecture – After Node Labels
DC1 DC2
DC3
NodeManger 1
NodeManger 2
NodeManger 3
NodeManger 4
NodeManger 5
NodeManger 6
NodeManager 7
NodeManager 8
NodeManager 9
NodeManager 10
NodeManager 11
NodeManager 12
ZK
Node Labels Configuration
Node.label: dc1 Node.label: dc2
A1
A1
A1
A1
3
Inter-DCs exchange Optimization
Inter-DCs Link
ZK : Zookeeper
JN : JournalNode

Stretch Cluster : YARN Architecture – Failover
DC1 DC2
DC3
NodeManger 1
NodeManger 2
NodeManger 3
NodeManger 4
NodeManger 5
NodeManger 6
NodeManager 7
NodeManager 8
NodeManager 9
NodeManager 10
NodeManager 11
NodeManager 12
ZK
A1
A1
A1
A1
Inter-DCs Link
ZK : Zookeeper
JN : JournalNode

Stretch Cluster : YARN Architecture – Failover
DC1 DC2
DC3
NodeManger 1
NodeManger 2
NodeManger 3
NodeManger 4
NodeManger 5
NodeManger 6
NodeManager 7
NodeManager 8
NodeManager 9
NodeManager 10
NodeManager 11
NodeManager 12
ZK
A1
A1
A1
A1
Automatic Failover Management
Inter-DCs Link
ZK : Zookeeper
JN : JournalNode

Conclusion
• DRP Tests & Concept Validation (including Infrastructures & Applications) :
• Disk Failure
• Node Failure
• Rack Failure
• DC Failure
• Inter-DCs Link Failure (avoid Split-Brain scenario)
• Stretch Clusters is implemented and validated for all HDP components :
Ambari, Kafka, Storm, AMS, HBase, Ranger, etc.
• SLAs Validation : Performance, RPO=0, RTO -> 0, Consistency, etc.
• Advanced Monitoring : Infrastructure, Inter-DCs Link, Applications, etc.

Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financial Applications

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financial Applications

Similar a Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financial Applications (20)

Más de DataWorks Summit

Más de DataWorks Summit (20)

Último

Último (20)

Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financial Applications

Notas del editor