SlideShare una empresa de Scribd logo
1 de 34
Continuity and Resilience (CORE)
ISO 22301 BCM Consulting Firm
Presentations by our partners and
extended team of industry experts
Our Contact Details:
INDIA UAE
Continuity and Resilience
Level 15,Eros Corporate Tower
Nehru Place ,New Delhi-110019
Tel: +91 11 41055534/ +91 11 41613033
Fax: ++91 11 41055535
Email: neha@continuityandresilience.com
Continuity and Resilience
P. O. Box 127557
Abu Dhabi, United Arab Emirates
Mobile:+971 50 8460530
Tel: +971 2 8152831
Fax: +971 2 8152888
Email: info@continuityandresilience.com
H A & D R Design Concepts
S Seshadri
Head – IT DR & Service Management
Continuity and Resilience
10th Feb, 2014
Dubai
2
Outage Categorization
• Service failures that should/need not be known to end users
need ‘fault protection’ – the operation of such services will be
continuous despite failure scenarios
• Short interruptions (within a few hours) are referred to as
‘minor outages’
• Longer interruptions, when end users’ business services get
delayed for longer durations, are termed as disaster situations
or ‘major outages’
3
Key Questions
1. Which systems should ‘never’ fail – we may need Fault Tolerant
systems in their place
2. What failures should be handled transparently, where an outage
must not occur? Against such failures we need fault protection.
3. How long may a short-term interruption be that happens once a day,
once a week, or once a month? Such interruptions are called minor
outages.
4. How long may a long-term interruption be that happens very seldom
and is related to serious damage to the IT system? For instance, when
will this cause a big business impact, also called a major outage or
disaster?
5. How much data may be lost during a major outage? And in which state
– persistent or ephemeral…
6. What failures are deemed so improbable that they will not be
handled, or what failures are beyond the scope of a project?
4
Business Issues & Cost of IT Outage
• IT Fault Protection has to be driven by business
considerations
• Business Continuity is the overall goal
• Business imperatives manifest through BIA/RA and
MTPoD/RTO/RPO
• IT Outage is not the real issue, but the business
consequences are
• IT Outage affects revenues & costs adversely
• Direct Costs – repairs, penalties, lost revenue
• Indirect Costs – lost & additional work hours
5
Cost Vs Benefit
• IT Recovery has extensive cost implications – both in terms of
Capex and Opex
• Strategies developed should be cost effective
• ‘Technology for the sake of Technology’ approach should be
completely avoided
• Strategies should, as far as possible, be able to address
disruptions and impacts collectively
• Organizational objectives and risk appetite should direct
recovery strategies
• Legal, contractual and regulatory aspects play a major role
(SOX, SAS 70, BASEL II/III…..)
6
IT Service Outage
• Importance of IT Services depends on
– Business relevance
– Revenues
– Functionality that they enable
– Amount of damage due to the outage
– Any regulatory aspect that demands the service
• Outage Categorization is dictated by the importance of the
service and hence the significance of its failure
7
High Availability
• High availability is the characteristic of a system to protect
against or recover from minor outages in a short time frame
with largely automated means.
• HA has 3 essential features
– Outage categorization is ‘minor’- we need to envisage
potential failure scenarios for the service and the minor
outage requirements for them - robustness
– System category should involve Mission Critical & Business
Important and Business Foundation processes which need
to be recovered within a very short time – RTO/RPO
– Component (SPoF) level protection which will facilitate
automatic recovery – redundancy
• HA features are normally built within the primary data center
and data replication is synchronous
8
Continuous Availability
• Continuous Availability is the highest point of High Availability,
wherein, every component failure is protected against, and no ‘after
failure recovery’ takes place
• These are known as Fault Tolerant systems, that provide automatic,
high-speed ‘failover’ in the case of h/w or s/w failures
• They have ‘internal multi-computer systems architecture’ that have
no shared central components, including memory
• Tandem’s ‘non-stop’ systems and Stratus’s fault tolerant computers
are examples of this
• These are used by the leading stock exchanges globally (NSE in India
uses Stratus and BSE, Tandem), and by banks for their ATM related
transaction processing
• These systems scale extremely well to the largest commercial
workloads
• These systems were introduced originally by Airbus for their A-320
planes for on-board flight controls In their long duration flights
HA Components
Essential ingredients of High Availability are:
• Availability
• Reliability
• Serviceability
We will discuss the above three in the following
slides.
10
Availability & Metrics
• Availability – How long a service or system component is
available for use and the features that help the system to stay
operational despite occurrence of failures, eg. NIC, Mirrored
Disks, Redundant Power Supply
• Availability = uptime/uptime+downtime
• Downtime will include scheduled downtime also
• Elapsed time can be measured as wall clock time
• Availability can be expressed in absolute numbers (79 hrs out
of 80 hrs or as a percentage (99.89%)
• Availability = MTBF/MTBF+MTTR (????)
– MTBF: Mean Time Between Failures
– MTTR: Mean Time To Repair
11
Reliability & Metrics
• Reliability is a measure of ‘fault avoidance’
• Refers to the ‘probability that a system will be available over a
time interval T’
• MTBF is a measure of Reliability
• Annual Failure Rate (AFR) is the inverse of MTBF
• Reliability features help to ‘prevent’ and ‘detect’ failures
• H/w reliability has tremendously improved over the last 30
years and they are highly resilient nowadays
Component MTBF (Hours) MTBF (Years) AFR (per year)
Disk Drive 300,000 34 0.0292
Power Supply 150,000 17 0.0584
Fan 250,000 28 0.0350
NIC 200,000 23 0.0438
12
Serviceability
• Measurement that expresses how easily and quickly
a system is serviced and repaired
• The lower the planned service time, the higher is the
availability
• Planned serviceability goes into the architecture as a
design objective
• Actual serviceability should be lower than planned
serviceability
• These clauses have to be carefully built into the
Service Level Agreements with IT vendors
• Murphy’s Law: Anything that can possibly go wrong,
does
13
HA/DR Strategy - Aspects
• Data – what is the architecture concerned with
• Function – how is the data worked with
• Location – where is the data worked with
• People – who works with the data and achieve the
functionality
• Time – when is the data processed
Each of the above aspects are run through 3 levels of abstraction
• Objectives – What will this achieve vis a vis org objectives
• Conceptual Model – Realization of the objectives on a
business process level
• System Model – Logical data model and the application
functions that must be implemented to realize the business
concepts
14
HA/DR Framework (Zachman)
Objectives Conceptual Model System Model
Data
(What)
Business Continuity /
IT Service Continuity
Availability of mission-
critical and important
business services
ICT categories,
dependency diagrams
Function
(How)
Map biz processes to IT
services, RTO, RPO, SLA
ITIL processes, IT
processes, projects
Design patterns – RAS,
redundancy, backup,
replication,
virtualization
Location
(Where)
Internal (IT),
Outsourced
Data Center, Disaster
Recovery Center
All systems, all
categories
People
(Who)
Biz process owner CIO/IT dept IT PM, Architect,
System Engineers,
System Administrators
Time
(When)
Implementation Plan Outage scenarios,
categories
Failure/Change/
Incident/Problem
/Disaster
15
HA/DR System Design
• System Model discussed earlier is the core of this activity
• ‘What’ and ‘How’ of the System Model will lay the foundation
for HA/DR System Design
• Protection against outages of computers, systems and
databases are in scope for HA
• Protection against infra/building/city/ outage,
user/administrative errors are in scope for DR
• Sound processes, solid architecture, careful engineering and
an eye for details are the hall marks of a good HA/DR system
design
16
HA/DR Touch Points
• User Environment
• Administration Environment
• Application
• Middleware
• Network Infrastructure
• Operating System
• Hardware (Servers, Storage, Backups etc)
• Physical Environment (Power, Fire, Floods etc)
17
HA/DR Scoping
• Take into account regulatory aspects (SOX, SAS, Basel II)
• Identify the key applications (from business BIAs)
• Check out the various ICT environments required by these
applications (IT BIA)
• Identify the dependencies
• Carefully identify and document the component categories
that are not required – scope exclusions
• Prepare preliminary system scope – list of component
categories required for HA/DR
• Identify failure scenarios for each of these component
categories
• Document the failure scenarios that are outside the scope
• The component categories and the failure scenarios will
constitute the scope of HA/DR
18
Redundancy & Replication
• Redundancy is the ability to continue operations in the case of
component failures
• Recovery is done through ‘managed component repetition’
• Eliminating ‘single points of failure’ is the goal
• Just adding a second component is not enough
• Replicated component has to be ‘managed’ to take over in
case the original component fails (failover)
• This ‘management’ can be automated or manual
• Replication of the ‘state’ of the component is crucial
• Replication may be a duplicate part, an alternate system (HA)
or an alternate location (DR)
• 100% redundancy through replication is very expensive and
difficult to achieve
19
Data Replication
• Redundancy for Disk Drives means ‘data replication’ and hence very
crucial
• Redundant disks provide multiple storage of data and/or OS
• Data disks carry one of the highest risks
• OS disks usually house the root file system and swap space
• Data Replication can be ‘synchronous’ or ‘asynchronous’
• RPO considerations should dictate data replication approach
• For very low or nil RPO, latency in data replication may not be
tolerated (synchronous vs asynchronous)
• Bandwidth considerations also impact replication
• Data Deduplication technology in recent times along with data
compression has reduced much of the headaches involved with
data replication
• Two main types of date replication
– Host based/Storage based
20
Virtualization
• Virtualization, as a concept, was demonstrated in 1960s ,
when IBM’s Thomas J Watson Research Center simulated
‘multiple pseudo machines’ on a single 7044 MX Mainframe
• Virtualization allows multiple operating system (OS) instances
to run concurrently on a single computer.
• It is a means of separating hardware from a single OS, by
“inserting an abstraction layer” into the software stack.
• Each ‘Guest’ OS is managed by a Virtual Machine Monitor.
• Virtualization Software can also collect a number of separate
resources and “pool” them, even if the devices or resources
remain in separate physical locations.
• The end goal is sharing the resources and capabilities flexibly,
under software control.
• The part of the virtualization package that enables to interact
with and control the VMs is referred to as the Virtual Machine
Monitor (VMM) or Hypervisor software.
21
Virtualization of Resources
• They supply resources in logical units to application programs and free
them from reliance on specific hardware
• Virtualization of Servers allows business to consolidate the workloads
running on multiple servers to just a FEW
• Storage Virtualization hides the physical storage from applications on host
systems, and presents a simplified (logical) view to the applications and
allows them to reference the storage resource by its common name
whereas the actual storage could be on a complex, multilayered,
multipath storage networks.
• RAID is an early example of storage virtualization.
• Virtual CPU is one of the oldest concepts, which has enabled
multiprocessing capability, handled by OS
• Virtual Memory is as old as Virtual CPU – again handled by the OS as part
of Virtual Memory Management
• Working within a virtualized environment may add some options and new
flexibility to your HA and DR plans.
22
Storage Virtualization
• With regard to storage, the objective is to bring together multiple
storage devices under unified command, whether they are from the
same manufacturer or not, and without regard for their physical
locations.
• Once accomplished, the now-unified band of storage systems can
be treated as a single, huge storage capacity that can be
provisioned, managed, backed up to tape, and even replicated to
offsite disaster recovery (DR) or high availability (HA) sites, with
greater visibility, synchronized automation, and reduced
management labour.
• Even archiving, multi-level storage, and information lifecycle
management (ILM) efforts can be made simpler, with older, slower,
or cheaper storage units provisioned to handle the near-line or
archival storage while newer, faster devices handle the current
production processes.
23
Host Clustering
• Increasing availability through redundancy on the host level
by taking several hosts and using them to supply a bunch of
services, where each service is not strictly associated with a
specific computer
• Host Clustering addresses
– Hardware errors
– OS errors
– Application errors
• Failover clusters , which allow a service to migrate from one
host to another in the case of an error. They are the most
used technology for high availability.
• Load-balancing clusters, which run a service on multiple hosts
from the start and handle outages of a host – more relevant
for performance than HA.
24
Middleware
• Generally considered to be the layer between the OS and the
applications
• They are independent of applications but carry application-
specific configuration and used by multiple applications
• Database Servers, Web Servers, Application Servers,
Messaging Servers are some examples
• HA for these will include product specific clustering, data
replication, and even session state replication
• Properly configured failover cluster sufficiently integrated
with the DB Server provides HA
• Redo log file shipping (asynchronous) with commits delayed
by the RPO will provide the best DR
• HA for Web Servers and Messaging Servers are achieved
mostly through Load-balancing Clusters (stateless)
25
HA for Applications
• Application HA is the eventual goal
• Application categories – Off the Shelf, Bought & Customized,
In-house Built
• Failover cluster is an approach most commonly adopted for all
categories of applications
• Applications touch the nerve center of all the following
systems:
– Development
– Acceptance/Integration Test
– Staging & Release
– Production
– Disaster Recovery
• Suitable precautions must be taken while coding/testing
stages to ensure HA
26
Networks
• Network is the backbone of ICT as it provides the linkages and
ability to communicate between component categories
• Various types of networks are
– LAN, VLAN, MAN, WAN, VPN, Intranet, Extranet, Internet
• And there are n/w components that help build and run the
networks – NIC, switches, routers, hubs, firewalls etc.
• Connectivity is the most major element of networks
• Data management on the network is done through encoding, data
compression & encryption/decryption
• Power supply, Heating, Ventilating & Air Conditioning (HVAC) are
two other important considerations
• It is absolutely essential to provide redundancies at each of the
network and component level/s for network HA
• Generally, there is no pay-load based state for any of these – hence
two or more devices would ensure HA
27
Data Back up and Restoration
• A major requisite for HA & DR
• Management of backed up data is equally important
• Restoration of data must work effectively
• Automated mechanisms exist
• System/file/database backups are the key
• Full or incremental backup
• Consistency of the data state is crucial
• Checkpoint functionality is useful in this context
• Storage and handling of backup media is very significant
• Remote (including at the DR site) storage of backups including
Tape Vaulting should be institutionalized
• Testing/recycling and proper maintenance of backup media
• Backup on failover clusters should distinguish between
physical and logical hosts in the cluster
28
HA & DR – Positioning
• HA and DR are two sides of the same coin
• Redundancy, Replication and Robustness are the key
characteristics of both HA & DR
• HA focuses on fault protection and is built on mostly
automated recovery techniques for minor outages
• HA is not built for environmental disasters like floods, fire,
earthquake and manmade incidents like terrorist attacks,
human errors of huge magnitude
• The above additional scenarios and major outages lead to the
need for DR, that focuses only on recovery
• DR is also associated with a large part of manual recovery in
terms of Emergency Management and Damage Assessment &
Recovery apart from IT Recovery
• When the primary data center is unavailable, migration to DR
site will be the only option
29
Disaster Recovery
• Disaster recovery is the ability to continue with services in the
case of major outages, often with reduced capabilities or
performance.
• Disaster recovery handles the disaster when either a single
point of failure is the defect or when many components are
damaged and the whole system is rendered non-functional.
• Operations cannot be resumed on the same system or at the
same site. Instead, a replacement or backup system, usually
located at another place is activated and operations continue
from there.
• Disaster recovery often restores only restricted resources and
thus restricted service levels.
• Continuation of service also does not happen instantly, but
will happen after some outage time.
30
DR in Context
• IT DR is activated when the likely recovery time is above the
least RTO and there is expected data loss
• IT recovery will be limited only by the agreed levels of service
by the business owners
• IT DR activities will be carried out of the DR site, which should
be equipped fully to handle IT services upto agreed levels
• Scaling up the IT services in due course of time will generally
be outside the purview of DR Planning
• Agreed levels of IT services are resumed in the DR site using
the infrastructure and back up data/tapes there
• The roles of primary and DR sites are interchangeable but not
in the strict sense of HA
• In the above scenario, both primary and DR sites will be
functional, even though they may cater to different business
activities/IT services
31
DR and the Cloud
• Cloud is the latest buzz word in outsourced business model
• Leveraging cloud model can optimize DR procedures
• Reduces the high cost of maintaining stand-by sites
• Cloud service providers normally have state of the art systems
and infrastructure, huge bandwidth, exacting security setup,
apart from complying with relevant ISO guidelines and
industry best standards.
• According to recent Aberdeen study report, DR is the leading
‘use case’ for cloud
• The key advantages are recovery times, virtualization and
multi-site availability
• Concerns regarding security, identity and compliance to
various regulations do exist as the cloud model matures
• With data volumes growing at the rate of 10 times every 5
years, cloud computing is likely to see a huge growth
32
DR in the Supply Chain
• Supply Chain is basically a delineation of dependencies
depicting the various actors in the chain of a product or
service from a vendor till reaching a consumer
• IT DR dependencies are manifold – internal customers, ICT
equipments, external vendors and service providers, IT staff,
etc etc…
• DR planning should judiciously take into account the inherent
risks in the supply chain and provision suitable mechanisms to
handle them effectively, so that the DR goal does not derail
• Typically, if Data Center support is outsourced, there is a huge
dependence on the Service Provider – timely availability of
people, spares, replacements etc.
• Supply chain glitches can emerge from as innocuous a thing as
consumables supplies
33
Thank You
S Seshadri

Más contenido relacionado

La actualidad más candente

Migration scenarios RISE with SAP S4HANA Cloud, Private Edition - Version #1....
Migration scenarios RISE with SAP S4HANA Cloud, Private Edition - Version #1....Migration scenarios RISE with SAP S4HANA Cloud, Private Edition - Version #1....
Migration scenarios RISE with SAP S4HANA Cloud, Private Edition - Version #1....
Yevilina Rizka
 

La actualidad más candente (20)

Audit of it infrastructure
Audit of it infrastructureAudit of it infrastructure
Audit of it infrastructure
 
BUSINESS CONTINUITY PLANNING AND RISK MANAGEMENT
BUSINESS CONTINUITY PLANNING AND RISK MANAGEMENTBUSINESS CONTINUITY PLANNING AND RISK MANAGEMENT
BUSINESS CONTINUITY PLANNING AND RISK MANAGEMENT
 
Bcp drp
Bcp drpBcp drp
Bcp drp
 
Business Continuity & Disaster Recovery
Business Continuity & Disaster RecoveryBusiness Continuity & Disaster Recovery
Business Continuity & Disaster Recovery
 
Marlabs Capabilities Overview: IT Services
Marlabs Capabilities Overview: IT ServicesMarlabs Capabilities Overview: IT Services
Marlabs Capabilities Overview: IT Services
 
Enterprise Identity and Access Management Use Cases
Enterprise Identity and Access Management Use CasesEnterprise Identity and Access Management Use Cases
Enterprise Identity and Access Management Use Cases
 
Enterprise Asset Management
Enterprise Asset ManagementEnterprise Asset Management
Enterprise Asset Management
 
Concepts of cutover planning and management
Concepts of cutover planning and managementConcepts of cutover planning and management
Concepts of cutover planning and management
 
ERP Training
ERP TrainingERP Training
ERP Training
 
EAM Overview
EAM OverviewEAM Overview
EAM Overview
 
SAP ERP Overview for Laymen
SAP ERP Overview for LaymenSAP ERP Overview for Laymen
SAP ERP Overview for Laymen
 
SAP Basis Overview
SAP Basis OverviewSAP Basis Overview
SAP Basis Overview
 
Migration scenarios RISE with SAP S4HANA Cloud, Private Edition - Version #1....
Migration scenarios RISE with SAP S4HANA Cloud, Private Edition - Version #1....Migration scenarios RISE with SAP S4HANA Cloud, Private Edition - Version #1....
Migration scenarios RISE with SAP S4HANA Cloud, Private Edition - Version #1....
 
ERP Implementation Life Cycle
ERP Implementation Life CycleERP Implementation Life Cycle
ERP Implementation Life Cycle
 
Day1 Sap Basis Overview V1 1
Day1 Sap Basis Overview V1 1Day1 Sap Basis Overview V1 1
Day1 Sap Basis Overview V1 1
 
Presentation introduction to sap
Presentation introduction to sapPresentation introduction to sap
Presentation introduction to sap
 
Privacy Trends: Key practical steps on ISO/IEC 27701:2019 implementation
Privacy Trends: Key practical steps on ISO/IEC 27701:2019 implementationPrivacy Trends: Key practical steps on ISO/IEC 27701:2019 implementation
Privacy Trends: Key practical steps on ISO/IEC 27701:2019 implementation
 
Understanding oracle fusion accounting hub
Understanding oracle fusion accounting hubUnderstanding oracle fusion accounting hub
Understanding oracle fusion accounting hub
 
Disaster Recovery Plan
Disaster Recovery PlanDisaster Recovery Plan
Disaster Recovery Plan
 
Sap s 4 hana client strategy
Sap s 4 hana client strategySap s 4 hana client strategy
Sap s 4 hana client strategy
 

Destacado

High Availability and Disaster Recovery
High Availability and Disaster RecoveryHigh Availability and Disaster Recovery
High Availability and Disaster Recovery
Akelios
 
Deep dive into highly available open stack architecture openstack summit va...
Deep dive into highly available open stack architecture   openstack summit va...Deep dive into highly available open stack architecture   openstack summit va...
Deep dive into highly available open stack architecture openstack summit va...
Arthur Berezin
 

Destacado (20)

High Availability and Disaster Recovery
High Availability and Disaster RecoveryHigh Availability and Disaster Recovery
High Availability and Disaster Recovery
 
Disaster recovery plan (DRP)
Disaster recovery plan (DRP)Disaster recovery plan (DRP)
Disaster recovery plan (DRP)
 
High Availability in 37 Easy Steps
High Availability in 37 Easy StepsHigh Availability in 37 Easy Steps
High Availability in 37 Easy Steps
 
План аварийного восстановления данных
План аварийного восстановления данныхПлан аварийного восстановления данных
План аварийного восстановления данных
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
Обеспечение непрерывности бизнеса и создание планов восстановления после аварии
Обеспечение непрерывности бизнеса и создание планов восстановления после аварииОбеспечение непрерывности бизнеса и создание планов восстановления после аварии
Обеспечение непрерывности бизнеса и создание планов восстановления после аварии
 
High Availability (HA) Explained
High Availability (HA) ExplainedHigh Availability (HA) Explained
High Availability (HA) Explained
 
Architecting for High Availability
Architecting for High AvailabilityArchitecting for High Availability
Architecting for High Availability
 
High Availability for OpenStack
High Availability for OpenStackHigh Availability for OpenStack
High Availability for OpenStack
 
Deep dive into highly available open stack architecture openstack summit va...
Deep dive into highly available open stack architecture   openstack summit va...Deep dive into highly available open stack architecture   openstack summit va...
Deep dive into highly available open stack architecture openstack summit va...
 
The A to Z Guide to Business Continuity and Disaster Recovery
The A to Z Guide to Business Continuity and Disaster RecoveryThe A to Z Guide to Business Continuity and Disaster Recovery
The A to Z Guide to Business Continuity and Disaster Recovery
 
Business continuity & disaster recovery planning (BCP & DRP)
Business continuity & disaster recovery planning (BCP & DRP)Business continuity & disaster recovery planning (BCP & DRP)
Business continuity & disaster recovery planning (BCP & DRP)
 
Drp International Brochure Version 5.5[1]
Drp International Brochure Version 5.5[1]Drp International Brochure Version 5.5[1]
Drp International Brochure Version 5.5[1]
 
MENORA
MENORAMENORA
MENORA
 
Top 10 DB2 Support Nightmares #9
Top 10 DB2 Support Nightmares  #9Top 10 DB2 Support Nightmares  #9
Top 10 DB2 Support Nightmares #9
 
Linux Disaster Recovery Solutions
Linux Disaster Recovery SolutionsLinux Disaster Recovery Solutions
Linux Disaster Recovery Solutions
 
A05
A05A05
A05
 
DB2 High Availability für IBM Connections, Sametime oder Traveler
DB2 High Availability für IBM Connections, Sametime oder TravelerDB2 High Availability für IBM Connections, Sametime oder Traveler
DB2 High Availability für IBM Connections, Sametime oder Traveler
 
Design patterns and plan for developing high available azure applications
Design patterns and plan for developing high available azure applicationsDesign patterns and plan for developing high available azure applications
Design patterns and plan for developing high available azure applications
 
High availability solutions bakostech
High availability solutions bakostechHigh availability solutions bakostech
High availability solutions bakostech
 

Similar a HA & DR System Design - Concepts and Solution

UnitOnePresentationSlides.pptx
UnitOnePresentationSlides.pptxUnitOnePresentationSlides.pptx
UnitOnePresentationSlides.pptx
BLACKSPAROW
 
CMGT410 v19Business Requirements TemplateCMGT410 v19Page 2.docx
CMGT410 v19Business Requirements TemplateCMGT410 v19Page 2.docxCMGT410 v19Business Requirements TemplateCMGT410 v19Page 2.docx
CMGT410 v19Business Requirements TemplateCMGT410 v19Page 2.docx
mary772
 
How much does it cost to be Secure?
How much does it cost to be Secure?How much does it cost to be Secure?
How much does it cost to be Secure?
mbmobile
 

Similar a HA & DR System Design - Concepts and Solution (20)

Best practices in networks and infrastructure
Best practices in networks and infrastructureBest practices in networks and infrastructure
Best practices in networks and infrastructure
 
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017
 
Impact 2013 2963 - IBM Business Process Manager Top Practices
Impact 2013 2963 - IBM Business Process Manager Top PracticesImpact 2013 2963 - IBM Business Process Manager Top Practices
Impact 2013 2963 - IBM Business Process Manager Top Practices
 
RIMS: Remote Infrastructure Management Services
RIMS: Remote Infrastructure Management Services RIMS: Remote Infrastructure Management Services
RIMS: Remote Infrastructure Management Services
 
NZS-4555 - IT Analytics Keynote - IT Analytics for the Enterprise
NZS-4555 - IT Analytics Keynote - IT Analytics for the EnterpriseNZS-4555 - IT Analytics Keynote - IT Analytics for the Enterprise
NZS-4555 - IT Analytics Keynote - IT Analytics for the Enterprise
 
MIRAI - Managing Industry Restructuring and Adoptions Inquisitively
MIRAI - Managing Industry Restructuring and Adoptions InquisitivelyMIRAI - Managing Industry Restructuring and Adoptions Inquisitively
MIRAI - Managing Industry Restructuring and Adoptions Inquisitively
 
VMworld 2013: SDDC IT Operations Transformation: Multi-customer Lessons Learned
VMworld 2013: SDDC IT Operations Transformation:  Multi-customer Lessons LearnedVMworld 2013: SDDC IT Operations Transformation:  Multi-customer Lessons Learned
VMworld 2013: SDDC IT Operations Transformation: Multi-customer Lessons Learned
 
L10 Architecture Considerations
L10 Architecture ConsiderationsL10 Architecture Considerations
L10 Architecture Considerations
 
UnitOnePresentationSlides.pptx
UnitOnePresentationSlides.pptxUnitOnePresentationSlides.pptx
UnitOnePresentationSlides.pptx
 
Troux Presentation Austin Texas
Troux Presentation Austin TexasTroux Presentation Austin Texas
Troux Presentation Austin Texas
 
CMGT410 v19Business Requirements TemplateCMGT410 v19Page 2.docx
CMGT410 v19Business Requirements TemplateCMGT410 v19Page 2.docxCMGT410 v19Business Requirements TemplateCMGT410 v19Page 2.docx
CMGT410 v19Business Requirements TemplateCMGT410 v19Page 2.docx
 
Lessons Learned from AMI Deployments and Asset Management Readiness
Lessons Learned from AMI Deployments and Asset Management ReadinessLessons Learned from AMI Deployments and Asset Management Readiness
Lessons Learned from AMI Deployments and Asset Management Readiness
 
What to expect from your IT People
What to expect from your IT PeopleWhat to expect from your IT People
What to expect from your IT People
 
Building a Business Continuity Capability
Building a Business Continuity CapabilityBuilding a Business Continuity Capability
Building a Business Continuity Capability
 
DATA CENTER AND BUSINESS COMMUNITY
DATA CENTER AND BUSINESS COMMUNITYDATA CENTER AND BUSINESS COMMUNITY
DATA CENTER AND BUSINESS COMMUNITY
 
BiznetGio Presentation Business Continuity
BiznetGio Presentation Business ContinuityBiznetGio Presentation Business Continuity
BiznetGio Presentation Business Continuity
 
How much does it cost to be Secure?
How much does it cost to be Secure?How much does it cost to be Secure?
How much does it cost to be Secure?
 
Top Down Network Design - ebrahma.com
Top Down Network Design - ebrahma.comTop Down Network Design - ebrahma.com
Top Down Network Design - ebrahma.com
 
Expectations in DRAAS from CSP
Expectations in DRAAS from CSPExpectations in DRAAS from CSP
Expectations in DRAAS from CSP
 
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
Optimizing connected system performance md&m-anaheim-sandhi bhide 02-07-2017
 

Más de Continuity and Resilience

The Business Continuity Conference, 25th October 2023 in Riyadh - Nuha Eltinay
The Business Continuity Conference, 25th October 2023 in Riyadh - Nuha EltinayThe Business Continuity Conference, 25th October 2023 in Riyadh - Nuha Eltinay
The Business Continuity Conference, 25th October 2023 in Riyadh - Nuha Eltinay
Continuity and Resilience
 

Más de Continuity and Resilience (20)

The Business Continuity Conference, 25th October 2023 in Riyadh - Mr. Atiq Bajwa
The Business Continuity Conference, 25th October 2023 in Riyadh - Mr. Atiq BajwaThe Business Continuity Conference, 25th October 2023 in Riyadh - Mr. Atiq Bajwa
The Business Continuity Conference, 25th October 2023 in Riyadh - Mr. Atiq Bajwa
 
The Business Continuity Conference, 25th October 2023 in Riyadh - Nuha Eltinay
The Business Continuity Conference, 25th October 2023 in Riyadh - Nuha EltinayThe Business Continuity Conference, 25th October 2023 in Riyadh - Nuha Eltinay
The Business Continuity Conference, 25th October 2023 in Riyadh - Nuha Eltinay
 
The Business Continuity Conference, 25th October 2023 in Riyadh - Paul Gant
The Business Continuity Conference, 25th October 2023 in Riyadh -  Paul GantThe Business Continuity Conference, 25th October 2023 in Riyadh -  Paul Gant
The Business Continuity Conference, 25th October 2023 in Riyadh - Paul Gant
 
The Business Continuity Conference, 25th October 2023 in Riyadh - David Boll...
The Business Continuity Conference, 25th October 2023 in Riyadh - David Boll...The Business Continuity Conference, 25th October 2023 in Riyadh - David Boll...
The Business Continuity Conference, 25th October 2023 in Riyadh - David Boll...
 
The Business Continuity Conference, 25th October 2023 in Riyadh - Abdulrahma...
The Business Continuity Conference, 25th October 2023 in Riyadh - Abdulrahma...The Business Continuity Conference, 25th October 2023 in Riyadh - Abdulrahma...
The Business Continuity Conference, 25th October 2023 in Riyadh - Abdulrahma...
 
DEFLUFFING RESILIENCE
DEFLUFFING RESILIENCEDEFLUFFING RESILIENCE
DEFLUFFING RESILIENCE
 
CREATING AND MAINTAINING A BCM PROGRAM
CREATING AND MAINTAINING A BCM PROGRAMCREATING AND MAINTAINING A BCM PROGRAM
CREATING AND MAINTAINING A BCM PROGRAM
 
BCM Challenges and Compliance
BCM Challenges and Compliance BCM Challenges and Compliance
BCM Challenges and Compliance
 
Thriving in the Crisis Situation
Thriving in the Crisis SituationThriving in the Crisis Situation
Thriving in the Crisis Situation
 
Cyber Security & IT Resilience
Cyber Security & IT Resilience Cyber Security & IT Resilience
Cyber Security & IT Resilience
 
Enterprise Resilience
Enterprise ResilienceEnterprise Resilience
Enterprise Resilience
 
Advancing the Enterprise Towards Enterprise Resilience
Advancing the Enterprise Towards Enterprise ResilienceAdvancing the Enterprise Towards Enterprise Resilience
Advancing the Enterprise Towards Enterprise Resilience
 
Bcm is all about people!
Bcm   is all about people!Bcm   is all about people!
Bcm is all about people!
 
SAMA BCM Framework
SAMA BCM Framework SAMA BCM Framework
SAMA BCM Framework
 
Value of Work Place Services in the Middle East
Value of Work Place Services in the Middle EastValue of Work Place Services in the Middle East
Value of Work Place Services in the Middle East
 
Social Media Influence in the field of Crisis Management– Case Studies
Social Media Influence in the field of Crisis Management– Case StudiesSocial Media Influence in the field of Crisis Management– Case Studies
Social Media Influence in the field of Crisis Management– Case Studies
 
Cyber Resilience Tips and Techniques For Protection & Response
Cyber ResilienceTips and Techniques For Protection & Response Cyber ResilienceTips and Techniques For Protection & Response
Cyber Resilience Tips and Techniques For Protection & Response
 
Business Continuity and Information Security- An Excellent Fit!
Business Continuity and Information Security- An Excellent Fit!Business Continuity and Information Security- An Excellent Fit!
Business Continuity and Information Security- An Excellent Fit!
 
Crisis Communication & BCM in Aviation Sector
Crisis Communication & BCM in Aviation SectorCrisis Communication & BCM in Aviation Sector
Crisis Communication & BCM in Aviation Sector
 
Effectiveness of Disaster Management Ground Reality and Potential.
Effectiveness of Disaster Management Ground Reality and Potential.Effectiveness of Disaster Management Ground Reality and Potential.
Effectiveness of Disaster Management Ground Reality and Potential.
 

Último

Karachi Sexy Girls || 03280288848 || Sex services in Karachi
Karachi Sexy Girls || 03280288848 || Sex services in KarachiKarachi Sexy Girls || 03280288848 || Sex services in Karachi
Karachi Sexy Girls || 03280288848 || Sex services in Karachi
Awais Yousaf
 
Call Girls In Lahore || 03274100048 ||Lahore Call Girl Available 24/7
Call Girls In Lahore || 03274100048 ||Lahore Call Girl Available 24/7Call Girls In Lahore || 03274100048 ||Lahore Call Girl Available 24/7
Call Girls In Lahore || 03274100048 ||Lahore Call Girl Available 24/7
Sana Rajpoot
 
Russian Call Girls New Delhi Whatsapp Numbers 07042364481 Russian Escorts Ser...
Russian Call Girls New Delhi Whatsapp Numbers 07042364481 Russian Escorts Ser...Russian Call Girls New Delhi Whatsapp Numbers 07042364481 Russian Escorts Ser...
Russian Call Girls New Delhi Whatsapp Numbers 07042364481 Russian Escorts Ser...
Hot Call Girls In Sector 58 (Noida)
 
Call Girls In Saidpur Islamabad-->>03274100048 <<--
Call Girls In Saidpur Islamabad-->>03274100048 <<--Call Girls In Saidpur Islamabad-->>03274100048 <<--
Call Girls In Saidpur Islamabad-->>03274100048 <<--
Ifra Zohaib
 
💚Call Girls Chandigarh 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...
💚Call Girls Chandigarh 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...💚Call Girls Chandigarh 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...
💚Call Girls Chandigarh 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...
Sheetaleventcompany
 

Último (20)

Raipur ❣️ Call Girl 97487*63073 Call Girls in Raipur Escort service book now
Raipur  ❣️ Call Girl 97487*63073 Call Girls in Raipur Escort service book nowRaipur  ❣️ Call Girl 97487*63073 Call Girls in Raipur Escort service book now
Raipur ❣️ Call Girl 97487*63073 Call Girls in Raipur Escort service book now
 
Bhubaneswar ❣️ Call Girl 9748763073 Call Girls in Bhubaneswar Escort service ...
Bhubaneswar ❣️ Call Girl 9748763073 Call Girls in Bhubaneswar Escort service ...Bhubaneswar ❣️ Call Girl 9748763073 Call Girls in Bhubaneswar Escort service ...
Bhubaneswar ❣️ Call Girl 9748763073 Call Girls in Bhubaneswar Escort service ...
 
Karachi Sexy Girls || 03280288848 || Sex services in Karachi
Karachi Sexy Girls || 03280288848 || Sex services in KarachiKarachi Sexy Girls || 03280288848 || Sex services in Karachi
Karachi Sexy Girls || 03280288848 || Sex services in Karachi
 
Nagpur ❤CALL GIRL 9874883814 ❤CALL GIRLS IN nagpur ESCORT SERVICE❤CALL GIRL I...
Nagpur ❤CALL GIRL 9874883814 ❤CALL GIRLS IN nagpur ESCORT SERVICE❤CALL GIRL I...Nagpur ❤CALL GIRL 9874883814 ❤CALL GIRLS IN nagpur ESCORT SERVICE❤CALL GIRL I...
Nagpur ❤CALL GIRL 9874883814 ❤CALL GIRLS IN nagpur ESCORT SERVICE❤CALL GIRL I...
 
Call Girls In Lahore || 03274100048 ||Lahore Call Girl Available 24/7
Call Girls In Lahore || 03274100048 ||Lahore Call Girl Available 24/7Call Girls In Lahore || 03274100048 ||Lahore Call Girl Available 24/7
Call Girls In Lahore || 03274100048 ||Lahore Call Girl Available 24/7
 
Indore Call girl service 6289102337 indore escort service
Indore Call girl service 6289102337 indore escort serviceIndore Call girl service 6289102337 indore escort service
Indore Call girl service 6289102337 indore escort service
 
Hyderabad ❣️ Call Girl 9748763073 Call Girls in Hyderabad Escort service boo...
Hyderabad ❣️  Call Girl 9748763073 Call Girls in Hyderabad Escort service boo...Hyderabad ❣️  Call Girl 9748763073 Call Girls in Hyderabad Escort service boo...
Hyderabad ❣️ Call Girl 9748763073 Call Girls in Hyderabad Escort service boo...
 
Guwahati ❣️ Call Girl 97487*63073 Call Girls in Guwahati Escort service book now
Guwahati ❣️ Call Girl 97487*63073 Call Girls in Guwahati Escort service book nowGuwahati ❣️ Call Girl 97487*63073 Call Girls in Guwahati Escort service book now
Guwahati ❣️ Call Girl 97487*63073 Call Girls in Guwahati Escort service book now
 
Kota ❤CALL GIRL 9874883814 ❤CALL GIRLS IN kota ESCORT SERVICE❤CALL GIRL IN
Kota ❤CALL GIRL 9874883814 ❤CALL GIRLS IN kota ESCORT SERVICE❤CALL GIRL INKota ❤CALL GIRL 9874883814 ❤CALL GIRLS IN kota ESCORT SERVICE❤CALL GIRL IN
Kota ❤CALL GIRL 9874883814 ❤CALL GIRLS IN kota ESCORT SERVICE❤CALL GIRL IN
 
Russian Call Girls New Delhi Whatsapp Numbers 07042364481 Russian Escorts Ser...
Russian Call Girls New Delhi Whatsapp Numbers 07042364481 Russian Escorts Ser...Russian Call Girls New Delhi Whatsapp Numbers 07042364481 Russian Escorts Ser...
Russian Call Girls New Delhi Whatsapp Numbers 07042364481 Russian Escorts Ser...
 
Jodhpur Call Girl 97487*63073 Call Girls in Jodhpur Escort service book now
Jodhpur  Call Girl 97487*63073 Call Girls in Jodhpur Escort service book nowJodhpur  Call Girl 97487*63073 Call Girls in Jodhpur Escort service book now
Jodhpur Call Girl 97487*63073 Call Girls in Jodhpur Escort service book now
 
NAGPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
NAGPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICENAGPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
NAGPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Digha Call Girl Service 97487*63073 Call Girls in Digha Escort service book...
Digha  Call Girl Service 97487*63073 Call Girls in Digha  Escort service book...Digha  Call Girl Service 97487*63073 Call Girls in Digha  Escort service book...
Digha Call Girl Service 97487*63073 Call Girls in Digha Escort service book...
 
Lucknow ❣️ Call Girl 97487*63073 Call Girls in Lucknow Escort service book now
Lucknow ❣️  Call Girl 97487*63073 Call Girls in Lucknow Escort service book nowLucknow ❣️  Call Girl 97487*63073 Call Girls in Lucknow Escort service book now
Lucknow ❣️ Call Girl 97487*63073 Call Girls in Lucknow Escort service book now
 
UJJAIN CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
UJJAIN CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICEUJJAIN CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
UJJAIN CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Call Now ☎9811255547|| Call Girls in Mahipalpur Escort Service Delhi N.C.R..pdf
Call Now ☎9811255547|| Call Girls in Mahipalpur Escort Service Delhi N.C.R..pdfCall Now ☎9811255547|| Call Girls in Mahipalpur Escort Service Delhi N.C.R..pdf
Call Now ☎9811255547|| Call Girls in Mahipalpur Escort Service Delhi N.C.R..pdf
 
Pune ❤CALL GIRL 9874883814 ❤CALL GIRLS IN pune ESCORT SERVICE❤CALL GIRL IN We...
Pune ❤CALL GIRL 9874883814 ❤CALL GIRLS IN pune ESCORT SERVICE❤CALL GIRL IN We...Pune ❤CALL GIRL 9874883814 ❤CALL GIRLS IN pune ESCORT SERVICE❤CALL GIRL IN We...
Pune ❤CALL GIRL 9874883814 ❤CALL GIRLS IN pune ESCORT SERVICE❤CALL GIRL IN We...
 
Bhopal ❤CALL GIRL 9874883814 ❤CALL GIRLS IN Bhopal ESCORT SERVICE❤CALL GIRL IN
Bhopal ❤CALL GIRL 9874883814 ❤CALL GIRLS IN Bhopal ESCORT SERVICE❤CALL GIRL INBhopal ❤CALL GIRL 9874883814 ❤CALL GIRLS IN Bhopal ESCORT SERVICE❤CALL GIRL IN
Bhopal ❤CALL GIRL 9874883814 ❤CALL GIRLS IN Bhopal ESCORT SERVICE❤CALL GIRL IN
 
Call Girls In Saidpur Islamabad-->>03274100048 <<--
Call Girls In Saidpur Islamabad-->>03274100048 <<--Call Girls In Saidpur Islamabad-->>03274100048 <<--
Call Girls In Saidpur Islamabad-->>03274100048 <<--
 
💚Call Girls Chandigarh 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...
💚Call Girls Chandigarh 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...💚Call Girls Chandigarh 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...
💚Call Girls Chandigarh 💯Riya 📲🔝8868886958🔝Call Girls In Chandigarh No💰Advance...
 

HA & DR System Design - Concepts and Solution

  • 1. Continuity and Resilience (CORE) ISO 22301 BCM Consulting Firm Presentations by our partners and extended team of industry experts Our Contact Details: INDIA UAE Continuity and Resilience Level 15,Eros Corporate Tower Nehru Place ,New Delhi-110019 Tel: +91 11 41055534/ +91 11 41613033 Fax: ++91 11 41055535 Email: neha@continuityandresilience.com Continuity and Resilience P. O. Box 127557 Abu Dhabi, United Arab Emirates Mobile:+971 50 8460530 Tel: +971 2 8152831 Fax: +971 2 8152888 Email: info@continuityandresilience.com
  • 2. H A & D R Design Concepts S Seshadri Head – IT DR & Service Management Continuity and Resilience 10th Feb, 2014 Dubai 2
  • 3. Outage Categorization • Service failures that should/need not be known to end users need ‘fault protection’ – the operation of such services will be continuous despite failure scenarios • Short interruptions (within a few hours) are referred to as ‘minor outages’ • Longer interruptions, when end users’ business services get delayed for longer durations, are termed as disaster situations or ‘major outages’ 3
  • 4. Key Questions 1. Which systems should ‘never’ fail – we may need Fault Tolerant systems in their place 2. What failures should be handled transparently, where an outage must not occur? Against such failures we need fault protection. 3. How long may a short-term interruption be that happens once a day, once a week, or once a month? Such interruptions are called minor outages. 4. How long may a long-term interruption be that happens very seldom and is related to serious damage to the IT system? For instance, when will this cause a big business impact, also called a major outage or disaster? 5. How much data may be lost during a major outage? And in which state – persistent or ephemeral… 6. What failures are deemed so improbable that they will not be handled, or what failures are beyond the scope of a project? 4
  • 5. Business Issues & Cost of IT Outage • IT Fault Protection has to be driven by business considerations • Business Continuity is the overall goal • Business imperatives manifest through BIA/RA and MTPoD/RTO/RPO • IT Outage is not the real issue, but the business consequences are • IT Outage affects revenues & costs adversely • Direct Costs – repairs, penalties, lost revenue • Indirect Costs – lost & additional work hours 5
  • 6. Cost Vs Benefit • IT Recovery has extensive cost implications – both in terms of Capex and Opex • Strategies developed should be cost effective • ‘Technology for the sake of Technology’ approach should be completely avoided • Strategies should, as far as possible, be able to address disruptions and impacts collectively • Organizational objectives and risk appetite should direct recovery strategies • Legal, contractual and regulatory aspects play a major role (SOX, SAS 70, BASEL II/III…..) 6
  • 7. IT Service Outage • Importance of IT Services depends on – Business relevance – Revenues – Functionality that they enable – Amount of damage due to the outage – Any regulatory aspect that demands the service • Outage Categorization is dictated by the importance of the service and hence the significance of its failure 7
  • 8. High Availability • High availability is the characteristic of a system to protect against or recover from minor outages in a short time frame with largely automated means. • HA has 3 essential features – Outage categorization is ‘minor’- we need to envisage potential failure scenarios for the service and the minor outage requirements for them - robustness – System category should involve Mission Critical & Business Important and Business Foundation processes which need to be recovered within a very short time – RTO/RPO – Component (SPoF) level protection which will facilitate automatic recovery – redundancy • HA features are normally built within the primary data center and data replication is synchronous 8
  • 9. Continuous Availability • Continuous Availability is the highest point of High Availability, wherein, every component failure is protected against, and no ‘after failure recovery’ takes place • These are known as Fault Tolerant systems, that provide automatic, high-speed ‘failover’ in the case of h/w or s/w failures • They have ‘internal multi-computer systems architecture’ that have no shared central components, including memory • Tandem’s ‘non-stop’ systems and Stratus’s fault tolerant computers are examples of this • These are used by the leading stock exchanges globally (NSE in India uses Stratus and BSE, Tandem), and by banks for their ATM related transaction processing • These systems scale extremely well to the largest commercial workloads • These systems were introduced originally by Airbus for their A-320 planes for on-board flight controls In their long duration flights
  • 10. HA Components Essential ingredients of High Availability are: • Availability • Reliability • Serviceability We will discuss the above three in the following slides. 10
  • 11. Availability & Metrics • Availability – How long a service or system component is available for use and the features that help the system to stay operational despite occurrence of failures, eg. NIC, Mirrored Disks, Redundant Power Supply • Availability = uptime/uptime+downtime • Downtime will include scheduled downtime also • Elapsed time can be measured as wall clock time • Availability can be expressed in absolute numbers (79 hrs out of 80 hrs or as a percentage (99.89%) • Availability = MTBF/MTBF+MTTR (????) – MTBF: Mean Time Between Failures – MTTR: Mean Time To Repair 11
  • 12. Reliability & Metrics • Reliability is a measure of ‘fault avoidance’ • Refers to the ‘probability that a system will be available over a time interval T’ • MTBF is a measure of Reliability • Annual Failure Rate (AFR) is the inverse of MTBF • Reliability features help to ‘prevent’ and ‘detect’ failures • H/w reliability has tremendously improved over the last 30 years and they are highly resilient nowadays Component MTBF (Hours) MTBF (Years) AFR (per year) Disk Drive 300,000 34 0.0292 Power Supply 150,000 17 0.0584 Fan 250,000 28 0.0350 NIC 200,000 23 0.0438 12
  • 13. Serviceability • Measurement that expresses how easily and quickly a system is serviced and repaired • The lower the planned service time, the higher is the availability • Planned serviceability goes into the architecture as a design objective • Actual serviceability should be lower than planned serviceability • These clauses have to be carefully built into the Service Level Agreements with IT vendors • Murphy’s Law: Anything that can possibly go wrong, does 13
  • 14. HA/DR Strategy - Aspects • Data – what is the architecture concerned with • Function – how is the data worked with • Location – where is the data worked with • People – who works with the data and achieve the functionality • Time – when is the data processed Each of the above aspects are run through 3 levels of abstraction • Objectives – What will this achieve vis a vis org objectives • Conceptual Model – Realization of the objectives on a business process level • System Model – Logical data model and the application functions that must be implemented to realize the business concepts 14
  • 15. HA/DR Framework (Zachman) Objectives Conceptual Model System Model Data (What) Business Continuity / IT Service Continuity Availability of mission- critical and important business services ICT categories, dependency diagrams Function (How) Map biz processes to IT services, RTO, RPO, SLA ITIL processes, IT processes, projects Design patterns – RAS, redundancy, backup, replication, virtualization Location (Where) Internal (IT), Outsourced Data Center, Disaster Recovery Center All systems, all categories People (Who) Biz process owner CIO/IT dept IT PM, Architect, System Engineers, System Administrators Time (When) Implementation Plan Outage scenarios, categories Failure/Change/ Incident/Problem /Disaster 15
  • 16. HA/DR System Design • System Model discussed earlier is the core of this activity • ‘What’ and ‘How’ of the System Model will lay the foundation for HA/DR System Design • Protection against outages of computers, systems and databases are in scope for HA • Protection against infra/building/city/ outage, user/administrative errors are in scope for DR • Sound processes, solid architecture, careful engineering and an eye for details are the hall marks of a good HA/DR system design 16
  • 17. HA/DR Touch Points • User Environment • Administration Environment • Application • Middleware • Network Infrastructure • Operating System • Hardware (Servers, Storage, Backups etc) • Physical Environment (Power, Fire, Floods etc) 17
  • 18. HA/DR Scoping • Take into account regulatory aspects (SOX, SAS, Basel II) • Identify the key applications (from business BIAs) • Check out the various ICT environments required by these applications (IT BIA) • Identify the dependencies • Carefully identify and document the component categories that are not required – scope exclusions • Prepare preliminary system scope – list of component categories required for HA/DR • Identify failure scenarios for each of these component categories • Document the failure scenarios that are outside the scope • The component categories and the failure scenarios will constitute the scope of HA/DR 18
  • 19. Redundancy & Replication • Redundancy is the ability to continue operations in the case of component failures • Recovery is done through ‘managed component repetition’ • Eliminating ‘single points of failure’ is the goal • Just adding a second component is not enough • Replicated component has to be ‘managed’ to take over in case the original component fails (failover) • This ‘management’ can be automated or manual • Replication of the ‘state’ of the component is crucial • Replication may be a duplicate part, an alternate system (HA) or an alternate location (DR) • 100% redundancy through replication is very expensive and difficult to achieve 19
  • 20. Data Replication • Redundancy for Disk Drives means ‘data replication’ and hence very crucial • Redundant disks provide multiple storage of data and/or OS • Data disks carry one of the highest risks • OS disks usually house the root file system and swap space • Data Replication can be ‘synchronous’ or ‘asynchronous’ • RPO considerations should dictate data replication approach • For very low or nil RPO, latency in data replication may not be tolerated (synchronous vs asynchronous) • Bandwidth considerations also impact replication • Data Deduplication technology in recent times along with data compression has reduced much of the headaches involved with data replication • Two main types of date replication – Host based/Storage based 20
  • 21. Virtualization • Virtualization, as a concept, was demonstrated in 1960s , when IBM’s Thomas J Watson Research Center simulated ‘multiple pseudo machines’ on a single 7044 MX Mainframe • Virtualization allows multiple operating system (OS) instances to run concurrently on a single computer. • It is a means of separating hardware from a single OS, by “inserting an abstraction layer” into the software stack. • Each ‘Guest’ OS is managed by a Virtual Machine Monitor. • Virtualization Software can also collect a number of separate resources and “pool” them, even if the devices or resources remain in separate physical locations. • The end goal is sharing the resources and capabilities flexibly, under software control. • The part of the virtualization package that enables to interact with and control the VMs is referred to as the Virtual Machine Monitor (VMM) or Hypervisor software. 21
  • 22. Virtualization of Resources • They supply resources in logical units to application programs and free them from reliance on specific hardware • Virtualization of Servers allows business to consolidate the workloads running on multiple servers to just a FEW • Storage Virtualization hides the physical storage from applications on host systems, and presents a simplified (logical) view to the applications and allows them to reference the storage resource by its common name whereas the actual storage could be on a complex, multilayered, multipath storage networks. • RAID is an early example of storage virtualization. • Virtual CPU is one of the oldest concepts, which has enabled multiprocessing capability, handled by OS • Virtual Memory is as old as Virtual CPU – again handled by the OS as part of Virtual Memory Management • Working within a virtualized environment may add some options and new flexibility to your HA and DR plans. 22
  • 23. Storage Virtualization • With regard to storage, the objective is to bring together multiple storage devices under unified command, whether they are from the same manufacturer or not, and without regard for their physical locations. • Once accomplished, the now-unified band of storage systems can be treated as a single, huge storage capacity that can be provisioned, managed, backed up to tape, and even replicated to offsite disaster recovery (DR) or high availability (HA) sites, with greater visibility, synchronized automation, and reduced management labour. • Even archiving, multi-level storage, and information lifecycle management (ILM) efforts can be made simpler, with older, slower, or cheaper storage units provisioned to handle the near-line or archival storage while newer, faster devices handle the current production processes. 23
  • 24. Host Clustering • Increasing availability through redundancy on the host level by taking several hosts and using them to supply a bunch of services, where each service is not strictly associated with a specific computer • Host Clustering addresses – Hardware errors – OS errors – Application errors • Failover clusters , which allow a service to migrate from one host to another in the case of an error. They are the most used technology for high availability. • Load-balancing clusters, which run a service on multiple hosts from the start and handle outages of a host – more relevant for performance than HA. 24
  • 25. Middleware • Generally considered to be the layer between the OS and the applications • They are independent of applications but carry application- specific configuration and used by multiple applications • Database Servers, Web Servers, Application Servers, Messaging Servers are some examples • HA for these will include product specific clustering, data replication, and even session state replication • Properly configured failover cluster sufficiently integrated with the DB Server provides HA • Redo log file shipping (asynchronous) with commits delayed by the RPO will provide the best DR • HA for Web Servers and Messaging Servers are achieved mostly through Load-balancing Clusters (stateless) 25
  • 26. HA for Applications • Application HA is the eventual goal • Application categories – Off the Shelf, Bought & Customized, In-house Built • Failover cluster is an approach most commonly adopted for all categories of applications • Applications touch the nerve center of all the following systems: – Development – Acceptance/Integration Test – Staging & Release – Production – Disaster Recovery • Suitable precautions must be taken while coding/testing stages to ensure HA 26
  • 27. Networks • Network is the backbone of ICT as it provides the linkages and ability to communicate between component categories • Various types of networks are – LAN, VLAN, MAN, WAN, VPN, Intranet, Extranet, Internet • And there are n/w components that help build and run the networks – NIC, switches, routers, hubs, firewalls etc. • Connectivity is the most major element of networks • Data management on the network is done through encoding, data compression & encryption/decryption • Power supply, Heating, Ventilating & Air Conditioning (HVAC) are two other important considerations • It is absolutely essential to provide redundancies at each of the network and component level/s for network HA • Generally, there is no pay-load based state for any of these – hence two or more devices would ensure HA 27
  • 28. Data Back up and Restoration • A major requisite for HA & DR • Management of backed up data is equally important • Restoration of data must work effectively • Automated mechanisms exist • System/file/database backups are the key • Full or incremental backup • Consistency of the data state is crucial • Checkpoint functionality is useful in this context • Storage and handling of backup media is very significant • Remote (including at the DR site) storage of backups including Tape Vaulting should be institutionalized • Testing/recycling and proper maintenance of backup media • Backup on failover clusters should distinguish between physical and logical hosts in the cluster 28
  • 29. HA & DR – Positioning • HA and DR are two sides of the same coin • Redundancy, Replication and Robustness are the key characteristics of both HA & DR • HA focuses on fault protection and is built on mostly automated recovery techniques for minor outages • HA is not built for environmental disasters like floods, fire, earthquake and manmade incidents like terrorist attacks, human errors of huge magnitude • The above additional scenarios and major outages lead to the need for DR, that focuses only on recovery • DR is also associated with a large part of manual recovery in terms of Emergency Management and Damage Assessment & Recovery apart from IT Recovery • When the primary data center is unavailable, migration to DR site will be the only option 29
  • 30. Disaster Recovery • Disaster recovery is the ability to continue with services in the case of major outages, often with reduced capabilities or performance. • Disaster recovery handles the disaster when either a single point of failure is the defect or when many components are damaged and the whole system is rendered non-functional. • Operations cannot be resumed on the same system or at the same site. Instead, a replacement or backup system, usually located at another place is activated and operations continue from there. • Disaster recovery often restores only restricted resources and thus restricted service levels. • Continuation of service also does not happen instantly, but will happen after some outage time. 30
  • 31. DR in Context • IT DR is activated when the likely recovery time is above the least RTO and there is expected data loss • IT recovery will be limited only by the agreed levels of service by the business owners • IT DR activities will be carried out of the DR site, which should be equipped fully to handle IT services upto agreed levels • Scaling up the IT services in due course of time will generally be outside the purview of DR Planning • Agreed levels of IT services are resumed in the DR site using the infrastructure and back up data/tapes there • The roles of primary and DR sites are interchangeable but not in the strict sense of HA • In the above scenario, both primary and DR sites will be functional, even though they may cater to different business activities/IT services 31
  • 32. DR and the Cloud • Cloud is the latest buzz word in outsourced business model • Leveraging cloud model can optimize DR procedures • Reduces the high cost of maintaining stand-by sites • Cloud service providers normally have state of the art systems and infrastructure, huge bandwidth, exacting security setup, apart from complying with relevant ISO guidelines and industry best standards. • According to recent Aberdeen study report, DR is the leading ‘use case’ for cloud • The key advantages are recovery times, virtualization and multi-site availability • Concerns regarding security, identity and compliance to various regulations do exist as the cloud model matures • With data volumes growing at the rate of 10 times every 5 years, cloud computing is likely to see a huge growth 32
  • 33. DR in the Supply Chain • Supply Chain is basically a delineation of dependencies depicting the various actors in the chain of a product or service from a vendor till reaching a consumer • IT DR dependencies are manifold – internal customers, ICT equipments, external vendors and service providers, IT staff, etc etc… • DR planning should judiciously take into account the inherent risks in the supply chain and provision suitable mechanisms to handle them effectively, so that the DR goal does not derail • Typically, if Data Center support is outsourced, there is a huge dependence on the Service Provider – timely availability of people, spares, replacements etc. • Supply chain glitches can emerge from as innocuous a thing as consumables supplies 33

Notas del editor

  1. Division of our complete problem into the above layers enables us to think about potential problems and their solution separately This separation builds the base for HA/DR scoping
  2. Eliminations have to be documented and sign off obtained All the ‘scope exclusions’ must be recognized during risk management and might need to be handled in separate projects
  3. ‘State’ does not just refer to data Data state in DR situations generally differ due to accepted RPO These refer to files or registry entries in the case of s/w components and firmware releases in the case of h/w Disks are redundant via volume manager Primary and secondary databases are redundant via system administrator NIC is redundant through OS (multipath configuration) In the above cases, VM/SA/MC could be the SPoF
  4. Through virtualization, Hewlett Packard consolidated no fewer than 86 of its own data centers to just three. The actual server counts and consolidation ratios vary, A ratio of 10:1 is not uncommon.
  5. H/w: No redundancy/redundant component had an error/redundancy activation did not work OS: Process scheduling error/processes hanging/memory management deficiency/network traffic glitches/file system corruption Apps: Memory leaks due to applications getting into endless loop/deadlocks in communication processes/other software errors F/O clusters – active/active; active/passive Suitable for application with stateful data
  6. Eg.: No application must use machine specific configuration of the physical host (as it will not recognize a virtualized host or a cluster node) On exceptional conditions, apps must contain start/stop/restart actions Long batch jobs should have check points for validation at restart Application needs to be designed in a cluster environment Tiered development approach for applications – UI (front end), business logic (middleware) database (backend) ‘From the scratch’ applications should deploy fault-tolerant requirements Code quality is of paramount importance Testing – function point, non-functional properties and end-to-end
  7. Open Systems Interconnection Reference Model provides seven layers of abstraction for networks – Physical, Datalink, Network, Transport, Session, Presentation, Application Popular network protocols are Ethernet, TCP/IP, Token Ring, Frame Relay, ATM, FC, etc. Network Outage is generally considered a major outage WAN outages are major and it is almost impossible to prevent totally – question remains if these multiple connections are independent or if they share some SpoF Typically the ‘last mile’ and the ‘proverbial digger’ syndrome WAN Virtualization dangers ISPs - SLAs – penalties – goes on and on Other network based services like DHCP, DNS, LDAP, AD, Email, Print etc have to be redundant depending on the need
  8. SOPs should be in place in details for all backup and restoration processes Specific personal responsibilities should be assigned for backup duties
  9. Internal and external cloud Private and Public cloud