Breaking the Kubernetes Kill Chain: Host Path Mount
AITP July 2012 Presentation - Disaster Recovery - Business + Technology
1. Disaster Recovery
Business & Technology
AITP Charleston
July 19, 2012
Andrew Miller
Senior Technical Consultant
t: @andriven w: www.thinkmeta.net
Varrow
2. One Big Reason to Do This
Expectations for Disaster
Recovery
≠ IT Capabilities
for Disaster Recovery
3. What is a Disaster?
• Disaster: An event that affects a service or system such
that significant effort is required to restore the original
performance level.
» IT Service Management Forum
But what does that look like IN
OUR ENVIRONMENT?
What disaster and recovery
scenarios should we plan for?
Where do we begin?
How do we do it?
5. Disaster Recovery vs. Operational Recovery
• Disaster Recovery
– To cope with & recover from an IT crisis that moves work to an
alternative system in a non-routine way.
– A real “disaster” is large in scope and impact
– DR typically implies failure of the primary data center and recovery to an
alternate site
• Operational Recovery
– Addresses more “routine” types of failures (server, network, storage,
etc.)
– Events are smaller in scope and impact than a full “disaster”
– Typically implies recovering to alternate equipment within the primary
data center
• Business expectations for recovery timeframe is typically
shorter for “operational recovery” issues than a true “disaster”
• Each should have its own clearly defined objectives
6. Risks, Threats and Vulnerabilities
Risk is a function of the likelihood of a given threat
acting upon a particular potential vulnerability,
and the resulting impact of that adverse event on
the organization.
7. Some threats that can cause Disasters…
• Human Error
• Localized IT systems /
network failure
• Extended power outage
• Telecommunications outage
• Storm / Weather damage
• Earthquake / Volcano
• Fire in the facility
• Facility flooding
• Local evacuation
• Cyber attack
• Sabotage
8. (Varrow) Disaster Recovery Approach
• Interviews with key personnel to understand Business Process priorities
and establish Business Impact Analysis (BIA).
• Review existing IT production infrastructure, including applications,
servers, storage, network, and external connectivity. Identify Risks and
Gaps.
• Establish Disaster Impact Scenarios and Disaster Recovery strategies to
meet requirements.
• Recommend Roadmap for establishing recovery capabilities and
documenting plans.
• Implement required recovery capabilities.
• Develop framework and content for IT DR Plan.
• Develop maintenance and test procedures for IT DR Plan.
• Address Business Continuity requirements and planning as appropriate.
9. What is the Business Impact Analysis?
• A conversation between IT and key stakeholders to
understand:
– What are the most time-critical and information-critical
business processes?
– How does the business REALLY rely upon IT Service and
Application availability?
– What are the Student, Financial, Regulatory, Reputational,
and other impacts of IT Service and Application
unavailability?
– What availability or recoverability capabilities are justifiable
based on these requirements, potential impact, and costs?
10. Disaster Recovery: Key Measures
Recovery Point Objectives Recovery Time Objectives
(RPO) (RTO)
5 6 7 8 9 10 11 12 1 2 3 4 5 6 7
a.m. a.m. a.m. a.m. a.m. a.m. a.m. a.m. p.m. p.m. p.m. p.m. p.m. p.m. p.m.
RPO: Amount of data lost from DECLARE RTO: Targeted amount of time
failure, measured as the amount DISASTER to restart a business service
10 a.m.
of time from a disaster event after a disaster event
11. Disaster Recovery: Key Measures
• Recovery Time Objective (RTO)
Maximum duration of disruption of service
• Recovery Point Objective (RPO)
Point in time to which application data is recovered / Maximum data loss
Weeks Days Hours Minutes Seconds Seconds Minutes Hours Days Weeks
Recovery Point Recovery Time
Real Time
Cost
12. BIA - Example Priority Tiers
Priority Tier Description
Priority 1 Services whose unavailability more than a brief period can have a severe impact on
High Availability / customers or time-critical business operations.
Immediate Recovery
Priority 2 Services whose unavailability significantly impacts customers or business
1-2 day recovery operations.
Priority 3 Services which can tolerate up to five days of disruption in a disaster.
3-5 day recovery
Priority 4 Services which can tolerate up to ten days of disruption in a disaster.
6-10 day recovery
Priority 3 and 4 systems may be restored in less time, depending on the situation.
However, higher priority functions will be restored first.
Priority 5 Non-critical services which can tolerate two weeks or more of disruption in a
“Best effort” recovery disaster. These systems will be restored on a best-effort basis, after other more
critical systems have been restored and ongoing operations have resumed.
Priority 5 systems may be restored in less time, depending on the situation.
However, higher priority functions will be restored first. In some cases, systems
deemed to not be required for continued operations may not be restored.
13. What does it take to RECOVER
from an IT Disaster?
• Data Protection
– Backups, Replication
• Recovery Facility
– Location to rebuild IT infrastructure or provision services
• Data Recovery & Storage
– Get Data into a form that is usable
• Servers / Compute Capacity
– Sufficient servers or virtual compute capacity to actually run the applications
• Network, Voice, and Data Communications
– Connect servers, storage and workers
– Connect the recovery site to work sites
– Communicate with customers
– Includes network, telecom, demarcation equipment; cabling; telecom provisioning
• DR Plan
– Documented and tested procedures for what to do, and how to do it
• People
14. Example Disaster Recovery Strategies
Priority Disaster Recovery Strategy Data Protection Approach
Priority 1 Establish hot site for systems and data in a Replicate / remote mirror / short
4 hour RTO or secondary data center at a remote interval remote disk-to-disk
less location that is unlikely to be impacted backup
by a local or regional event.
Priority 2 Maintain sufficient remote physical or virtual Remote disk-to-disk backup
24-48 hour RTO infrastructure for restoration. Ensure
sufficient space/power in recovery
facility.
Priority 3 Ensure ability to quickly acquire Tape (with sufficient off-site rotation)
72 hour RTO infrastructure for restoration. Ensure or remote disk-to-disk backup
sufficient space/power in recovery
facility.
Priority 4 Ensure ability to quickly acquire Tape (with sufficient off-site rotation)
1-2 week RTO infrastructure for restoration. Ensure or remote disk-to-disk backup
sufficient space/power in recovery
facility.
15. Storage Arrays + Replication
PRODUCTION SITE OPTIONAL DISASTER RECOVERY SITE
Application Local RecoverPoint bi-directional Remote Standby
servers copy replication/recovery copy servers
RecoverPoint RecoverPoint
appliance appliance
Production and
local journals
Prod Fibre Remote
SAN LUN Channel/WAN journal SAN
s
Storage Storage
Host-based write splitter arrays arrays
Fabric-based write splitter
Symmetrix VMAXe, VNX-, and
CLARiiON-based write splitter
16. Site A (Primary) Site B (Recovery)
Site Site
vCenter Server Recovery vCenter Server Recovery
Manager Manager
vSphere vSphere
vSphere
Replication
Storage-based
replication
vSphere Replication
Simple, cost-efficient replication for Tier 2 applications and smaller sites
Storage-based Replication
High-performance replication for business-critical applications in larger sites
Note to Presenter: View in Slide Show mode for animation. When EMC or its partners talk about remote replication, they usually mean between storage at two locations. The source and target are physically separated to reduce the risks associated with co-location. Remote replicated systems could be across a campus, across a town, or across the globe. Their physical distance and technology selected can affect how quickly you recover from a disruption and how much data is lost.Organizations normally set requirements for how much lost data and how much time to come back online is acceptable. The recovery point objective (RPO) is the amount of data that can be lost, measured in terms of time without being catastrophic to the business. The recovery time objective (RTO) is the amount of time that it takes to recover the data and restart your business services from the recovered data. Remote replication provides much lower RPOs (at or close to zero) and very small RTOs, depending on implementation. The bottom line is that replication is appropriate for all types of data, and the RPO and RTO you target are going to affect your implementation.For multiple RPOs and for remote replication with either zero or low RPO—and near-instant to instant recovery with DVR-like technology, EMC offers the RecoverPoint family.