VMworld 2013
Jeff Hunter, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Ken Werneburg, VMware
5. 5
Disaster Recovery vs. Business Continuity
Example: Tuesday, August 23, 2011 at 1:51 PM EDT - Magnitude 5.8
earthquake near Mineral, Virginia
Disaster recovery required?
No
Interruption to business continuance?
YES!
6. 6
Fault Tolerance vs. High Availability
Fault tolerance
• Ability to recover from component loss
• Example: Hard drive failure
High availability
Uptime percentage in one year Downtime in one year
99 3.65 days
99.9 8.76 hours
99.99 52 minutes
99.999 “five nines” 5 minutes
X
7. 7
RTO, RPO, and MTD
Recovery Time Objective (RTO)
• How long it should take to recover
Recovery Point Objective (RPO)
• Amount of data loss that can be incurred
Maximum Tolerable Downtime (MTD)
• Downtime that can occur before significant loss is incurred
• Examples: Financial, reputation
9. 9
VMware vFabric™ tc Server
vSphere App HA New
Policy-based
Protect off-the-shelf apps
10. 10
vSphere App HA
vSphere HA Cluster
vFabric
Hyperic
Virtual Appliance
vSphere App HA
Virtual Appliance
Hyperic Agents
Running in VMs
vCenter
Server
vSphere vSphere vSphere vSphere
New
12. 12
vSphere HA – Keep In Mind…
RTO – measured in minutes (not seconds)
Requires shared storage
Best practices
• Use admission control – percentage policy
• Test post-failure performance with host maintenance mode
• Isolation response – leave powered on
• Network and storage redundancy
• Also see BCO5047
13. 13
vSphere Fault Tolerance (FT)
Zero recovery time, data loss
• Host hardware failure only
• Does not protect against OS and application failure
Works fine with HA, App HA
Why not FT?
• Resource requirements – does workload really need it?
• VM has multiple CPUs – see BCO5065
• No VM snapshots – backups require agent
14. 14
Data Protection (Backup and Restore)
Agents? No Agents? – Both!
• No agents for majority of workloads – keep it simple
• Agents for certain apps
vSphere Data Protection (VDP) Advanced
• Backup and recovery for VMware, from VMware
• Based on proven, mature EMC Avamar™
• Agent-less VM backup and restore
• Agents for granular tier-1 application protection
16. 16
VDP Advanced – Keep In Mind…
Engineered for SMB environments
Uses VADP – VM snapshots, CBT
Utilizes Windows VSS in VMware Tools
Works fine with HA, not with FT
RDM – virtual yes, physical no
Is it DR?
• Maybe – depends on RTO, RPO
• Needs replication offsite, right? – see BCO5041
17. 17
VDP Advanced – Keep In Mind…
Best Practices
• Prepopulate DNS, always use FQDN
• Manage VM snapshots
• Avoid deploying to slow storage
• Do not power-off, always shut down gracefully
• Do not schedule backups during maintenance window
• Also see BCO4756 and BCO5041
18. 18
vCenter Availability
Run vCenter Server application in a VM
Run vCenter Server database in a VM
Run both in same VM?
Protect with vSphere HA
• vCenter and DB VM restart priority set to High
• Enable guest OS and App monitoring
App HA can protect SQL Server database
19. 19
vCenter Availability
Back up vCenter Server VM and database
• Image-level backup for vCenter Server VM
• App-level backup using agent for database backup
Why not FT for vCenter Server?
• vCenter Server requires minimum of 2 vCPUs
• FT does not protect against application failure
Replicate vCenter Server, database VMs?
20. 20
vCenter Availability – vCenter Server Heartbeat
Pros
• Better RTO and RPO – typically ~5 minutes
• Protects against host and guest OS failure
• Checks network connectivity
• Monitors application services and performance
Cons
• Complexity
• Requires double the resources
• Licensing cost
21. 21
vSphere Replication – DR
Native tool built into the platform
Per-VM hypervisor replication, managed in VC
Selectable RPO
from 15 min up
to 24 hours
Selectable
destination
datastore (Disk-
type agnostic)
22. 22
Replication Across Sites
vCenter Server
ESXi
NFC
VRA
ESXi
NFC
VRA
ESXi
NFC
VRA
Storage
Storage
(VMDK1)
vCenter Server
ESXi
NFC
VRA
ESXi
NFC
VRA
ESXi
NFC
VRA
VR
Appliance
VR
Appliance
Storage
Storage
VMDK1
vCenter Server vCenter Server
23. 23
Four Steps for Full Recovery
Right-click,
select “Recover”
Select a target
folder
Select a target
resource
Click Finish
Will validate your choices as you go
24. 24
New Feature – Retain Historical Replicas
vSphere
VR Agent
After recovery, use the snapshot manager to revert
to earlier points
Retention of
multiple
points in
time allows
reversion to
earlier
known
good states
25. 25
MPIT Presented as VM Snapshots after Failover
Use the snapshot manager to revert to earlier points, an interface
all administrators have been comfortable with for many years.
26. 26
vSphere Replication – Interoperability
Fault tolerance –
Doesn’t work with VR
• FT conflicts at the
vSCSI disk filter level.
VDP
• Mostly no problem!
• If using VSS… ensure
you are using 5.5!!
HA, vMotion, DRS
Storage vMotion
and Storage DRS
• Now supported!
27. 27
vSphere Replication – Best Practices
RPO
• Only what is necessary!
• Just because you can…
RTO
• Don’t set one! No testing,
no automation, manual
process.
VSS – Only if necessary!
What about bandwidth?
• Very hard to determine.
Do a local loopback first.
RDMs?
• Don’t use them. If you must, use
virtual compatible.
Don’t mix ABR and VR!
28. 28
SRM
• A Disaster Recovery engine
• A tool that uses externally replicated data (VR or
array based) to speed the RTO of a BCP
• A product that allows for DR to be tested,
automated, planned, repeatable and customizable
What is it?
• A replication engine
• A tool for systems that need near-instant RPO
• A disaster avoidance stretched cluster
What is it not?
29. 29
Key Components of SRM
Replication
vCenter Server
SRM Server
One vCenter Server
(Windows or VCVA) per
site, same versions
One SRM Server per
site, same versions
vSphere hosts,
recommend same
versions per site (pre
vSphere 5.x only if using
array replication)
vSphere Essentials Plus and higher editions supported
vCenter Server
30. 30
SRM Replication Options
SRM can utilize BOTH array
based AND vSphere Replication
SRM will “see” existing
standalone vSphere
Replication protected VMs
SRM can install vSphere
Replication from scratch
if needed
Hub
LUN 2
Web
Multi-tier App
DB
App
vSphere Replication
Storage-based Replication
LUN 1
Web
DB
App
Multi-tier App
31. 31
Recovery Workflows
• User defined recovery plan
• Minimize errors
Failover Automation
• Isolated test environment
• Increase confidence in DR process
Non-disruptive Failover
Testing
• Zero data loss
• Operational migration
Planned Migration
• Re-protect VM’s, migrate back
Failback Automation
32. 32
SRM Interoperability
Works with VR –and- ABR
Backups, VADP or other
are fine
HA is no problem at all
vMotion and DRS are fine
Storage vMotion and
Storage DRS – Sort of…
• Replication Dependent
FT is “yellow”
• Array replicated only and the FT
status is not recovered
Web vs vSphere Client
33. 33
SRM – A Few Best Practices
Not
exhaustive
How long is Vmworld?
Big ones: Storage Layout
Test Network Configuration
Test often!
Size vCenter correctly
Biggest
one:
Do a Business Impact
Analysis
RPO, RTO, Cost of downtime,
interdependencies, criticality of
applications, priorities, units of
failover, overlooked
externalities, executive buy-in,
…..
34. 34
SRM Further Detail at VMworld
• BCO5733 - vCenter Site Recovery Manager – Solution Overview and Lessons
from a Fortune 500 Health Care Company Implementation
• BCO5129 - Protection for All - vSphere Replication & SRM Technical Update
• BCO5170 - DR to The Cloud with VMware Site Recovery Manager and
Rackspace Disaster Recovery Planning Services
• BCO5652 - Three Quirky Ways to Simplify DR with Site Recovery Manager
• BCO4905 - Disaster Recovery Solution with Oracle Data Guard and Site
Recovery Manager
35. 35
Protection Groups (PGs)
More PGs = more granular testing/failover
• DR testing is easier – fewer resource requirements
• Fail-over only what is needed
• More configuration/complexity
Less protection groups = less complex
• Fewer LUNs, PGs, recovery plans
• Less flexibility
Find a good balance between flexibility and simplicity
Fewer LUNs/PGs
Less complexity
Less flexibility
More LUNs/PGs
More complexity
More flexibility
Right combination
of complexity and
flexibility
Varies by customer
Majority of outages
are partial (not entire
data center) – design
accordingly
36. 36
Test Network
• Use VLAN or isolated network for test environment
• Default “Auto” setting does not allow VM communication between hosts
• Different vSwitch can be specified in SRM for test versus run
• Specified in Recovery Plan
40. 40
VMware – Multiple Levels of Protection
SQL
vSphere HA/FT
VR/SRM
SQL
VDPA
Site A Site B
41. 45
Other VMware Activities Related to This Session
HOL:
HOL-SDC-1305
Business Continuity and Disaster Recovery In Action
VMworld Session:
BCO-5160
Implementing a Holistic BC/DR Strategy – Part 1