2. Database HA & DR
Experience…
Work with business to determine HA or DR
requirements for applications and data?
Design HA or DR solutions?
Administer HA or DR process?
Still learning MS SQL Server HA & DR
capabilities?
2
3. Scope of this Presentation
Data Availability
Data recovery
High availability
Disaster recovery
Technology Focus
MS SQL Server
Physical servers
SANs
In-depth how-to
(available elsewhere)
Partitioned views (federated)
Advanced DBA techniques
Custom application logic
3rd-party software solutions
Alternate DBMS engines
(e.g. Oracle; DB2)
HA on virtual machines
Complex scenarios & solutions
Load balancing
Presentation Focus Beyond Scope of Presentation
3
4. So, you need to make your
production database bulletproof…
Introduction to Data
Availability
4
5. Data Availability Continuum
Degrees of protection for information systems:
Business Risk Solution
Data Recovery Data loss Redundant data
High Availability Downtime of
database service
Redundant
system
components
Disaster
Recovery
Downtime of
business
operations
Redundant
systems
and facilities
5
6. Business Case for Availability
Keep business-
critical applications
available
Secondary:
Server maintenance
Protect against loss
of data center
Secondary:
Application
upgrades
Infrastructure
upgrades
High Availability Disaster Recovery
6
7. Service Level Agreement (SLA)
Permitted downtime (planned vs. unplanned?)
Acceptable data/transaction loss
Application response times
Mean time to recovery
Note: Database uptime is not equivalent to application availability
Failures of other application services
Network outages
Uptime SLA Downtime
per Year
Downtime
per Month
99.9% 8.76 hours 43.8 minutes
99.99% 52.6 minutes 4.38 minutes
99.999% 5.26 minutes 0.438 minutes
7
8. Protect What?
Application data stores
Databases
Files
Other data repositories
Database services
DBMS availability for applications
Application services
Application availability for users and external systems
Databases are the heart of most information
systems;
they deserve the highest affordable protection.
8
9. Database Failure Scenarios
Storage subsystem
Disk
Controller
Network
Server
Power
Operator errors
DBMS interruption
Drops / deletes
Application defects
DBMS defects
Data corruption
Physical Infrastructure
Failures
Logical Data Failures
9
10. Service Recovery Strategies
Standb
y
Mode
Failover Behavior SQL Server
Feature
Cold
standb
y
• Manual intervention
required to restore offline
data copy
• Backup and
restore
Warm
standb
y
• Data copy online and
ready
• Manual failover required
• Transaction log
shipping
• Database
mirroring
Hot
standb
y
• Automatic failover • Database
mirroring
• Failover clustering
10
11. Data Recovery—Terminology
Terminology varies for source vs. copy
High Availability
Strategy
Data Source Data Copy
Backup and Restore Database Backup
Log Shipping Primary Secondary
Standby
Database Mirroring Principal Mirror
Failover Clustering Primary
Active
Secondary
Passive
Standby
Inactive
11
13. Database Backups
Traditional backup types
Full backup
Differential backup
Transaction log backup
Disk is better than tape
First backup to disk (separate physical disk volume)
Detect exceptions encountered during backup
Verify backup files
Copy backup files to tape or remote disk
Data retention policy for backup files
13
14. Database Backup Strategy
Backup of user databases not sufficient for
recovery
System database
Master database
MSDB database
Model database
External data stores…
14
15. Synch with External Data
Stores
Synchronize recovered database with external
data stores:
Identity column seeds
Full-text indexes
(SQL Server 2000)
LDAP entries
File system objects
Other databases
15
16. Backup Retention Policy
Location of backup files
Duration of retention
Protection of sensitive data
Sarbanes/Oxley (SOX)
HIPAA
Internal policies for data management and
protection
Access to backups from offsite data storage
16
17. Data Recovery Process
Backup file sets
Full baseline, differential,
and transaction logs
Retrieving backup files
Offsite storage
Tape
Network copy
Dependency on multiple
people to get access to
backup files
Recovery strategy
depends on failure
scenario
Create comprehensive
failure matrix
Devise recovery strategy
for each scenario
Does worst-case recovery
scenario fit within SLA
parameters?
Recovery time; SLA
Include future data growth
in recovery plan
Fully test recovery
strategies—practice is
essential
17
19. High Availability
Minimize or avoid service downtime
Whether planned or unplanned
When components fail,
service interruption is brief or non-existent
Automatic failover
Eliminate single points of failure (as affordable)
Redundant components
Fault-tolerant servers
19
20. Redundant Components
Objective: Avoid single points of failure (where affordable)
Approach: Use redundant components for database service
Database server nodes
Server components
ECC RAM; failure-tolerant HW & OS
DBMS instance
User databases
Storage devices
Storage unit components
MPIO: Interfaces; paths; switches; controllers
RAID: Disks
Networking
MPIO: Interfaces; paths; switches
Data copies
E.g. Recovering torn page from mirror in SQL Server 2008
20
21. Transaction Log Shipping
Warm standby solution
Duplicate user database
Copy transaction logs to standby server & restore
Database available for read-only access
Users must disconnect for logs to be applied
Two database licenses required if querying
standby
Manual application failover
Supported on standard hardware
Possible data loss (unapplied transactions)
21
22. Database Mirroring
Redundancy at user database level
Duplicate copy of user database
Independent storage devices
Multiple copies of instance databases
Mirrored over private network channel
Mirror always redoing transactions from principal
Negligible impact on transaction throughput
Multiple mirroring modes:
High-availability: commit @ log on mirror; automatic
failover
High-protection: commit @ log on mirror; manual failover
High-performance: commit when logged on principal
Very fast automatic failover—seconds
Requires witness server
Mirror-aware application client connection
Provided by client library
Database connection string must specify both servers
Mirror may be available for read-only access
(snapshots)
Works with standard hardware
Local Storage
· local sys DBs
· mirror user DB
Local Storage
· local sys DBs
· source user DB
node A node B
witness
(optional)
22
23. Mirror Witness
With mirroring, more than one server is required
to decide on failover
Witness automates failover from primary to mirror
Watches database availability
Reports observations back to principal and mirror
Runs in separate SQL Server instance (Express is
OK)
Prevents “split brain” scenario
Very low resource consumption
Can be witness for multiple databases
Not a single point of failure
23
24. SQL Server Failover Clustering
Shared Storage
· system DBs
· user DBs
· quorum
node A node B
Two clustered nodes
Active/Passive config
MS SQL services
Running on virtual
server
Shared storage device
User databases
System databases
Quorum drive
Redundant internal
components
24
25. Active/Passive Failover
Clustering
Redundancy at database instance
level
All databases fail over together
Shared copy of system databases
Single data copy on shared storage
device
No I/O overhead reducing throughput
Storage unit is single point of failure
for cluster
All database services are clustered
SQL Agent; Analysis Services; Full-
Text engine, MS DTC
Automatic failover (up to minutes)
DBMS accessed over virtual IP
Database not available from inactive
node for DB client connections
Storage is controlled by one cluster
node at a time
Requires hardware certified by
Microsoft for Microsoft Cluster
Service
Shared Storage
· system DBs
· user DBs
· quorum
node A node B
25
26. HA Comparison
Scope: user DB
Standard hardware
One SQL license
(unless querying snapshots
on mirror)
Very fast failover (seconds)
OS flexible (e.g. 32/64)
Independent storage
Independent services
Reporting on mirror
Geographic separation OK
Scope: DBMS instance
Certified hardware
One SQL license
(only one node can access
database)
Automatic failover (up to
minutes)
Enterprise OS
Shared storage
Clustered services
Standby not available
Servers are usually co-
located
Database Mirroring Failover Clustering
26
27. Considerations for HA
HA complements backup and recovery strategy
Does not replace data recovery plan
Application service availability is often determined
by a network of interdependent services
Availability can be difficult to define (e.g. partial
failures)
Failure probability difficult to measure or compute
Increased system complexity could lead to lower
service availability!
Operator error a leading cause of availability issues
Increased number/types of system components
More complex to configure and administer
27
30. Disaster Recovery
Minimize downtime of business operations
Redundant systems and facilities
SQL Server features:
Transaction log shipping
Database mirroring
Failover clustering
Other technologies
Storage-based mirroring
30
31. Disaster Recovery Planning
Data security requirements
Clarify SLA, data loss allowance
Evaluate system cost vs. data protection
Failure analysis
System redundancy
Process validation
Training for personnel
Prevention practices
Executing disaster recovery and business continuity
Practice, practice, practice
31
32. Business Continuity Facility
System redundancy
Systems: Web servers app servers; database,
etc.
Data: Databases; data files on OS; security info,
etc.
Networking: Domain, routing, subnet, VIPs, etc.
Alternate facilities
Network bandwidth
Physical or network access by operations staff
Failover
Often a deliberate decision, using manual failover
32
33. Data Redundancy
Synchronous redundancy
Network bandwidth cost
Network latency and application performance
Network reliability
Asynchronous redundancy
Risk of data loss
More cost-effective
Resilient to network latency issues
Candidate Technologies
SQL Server database mirroring
Failover clustering with SAN-based mirroring
33
34. DR Using Database Mirroring
Two sites: Primary and DR location
Separate failover clusters at each site
SQL Server database mirroring between sites
witness
(optional)
Shared Storage B
· local sys DBs
· local quorum
· mirror user DB
node B1 node B2
Shared Storage A
· local sys DBs
· local quorum
· source user DB
node A1 node A2
failover cluster at site A failover cluster at site B
database
mirroring
34
35. DR Using SAN-Based Mirroring
Two sites: Primary and DR location
Four-node failover cluster; one virtual IP
address
SAN-based mirroring between sites
Manual cluster failover
Shared Storage B
· system DBs
· quorum
· user DBs
node B1 node B2
Shared Storage A
· system DBs
· quorum
· user DBs
node A1 node A2
failover cluster nodes at site A failover cluster nodes at site B
storage-
based
mirroring
35
36. [Skip if time is running short.]
Complimentary Technologies36
37. SAN-Based Data Mirroring
Data blocks duplicated at storage level
Similar to transaction log shipping
Copy performed in sequence and coordinated
with database checkpoint
Ensures consistency of mirrored data files
Synchronous or asynchronous mirroring
Co-located or geographically dispersed—both are
OK
SAN link bandwidth must support database I/O rate
May require extra feature support from SAN
vendor
Could rely on Failover Clustering for HA
37
38. SQL Server Database
Snapshots
Read-only point-in-time database snapshot
No data is copied—instantaneous
Historical snapshot pages tracked separately
from changing pages
Snapshots can be maintained indefinitely
Limited only by available storage
Snapshot copy can be used for reporting
Read-only, so no locking issues
38
39. SQL Server Replication
Transactional
replication
High transaction volume
Low data latency
required
Mixed technologies:
Integrates with other
DBMS
Merge replication
Bi-directional data
changes
Typically server-to-client
Snapshot replication
Large, infrequent data
changes
Data change latency OK
Subscriber databases
available for reporting
Replicate data subsets
Some data loss is
possible
Periodically validate
replicated data
39
41. Considerations for App
Developers
App services tolerant to database service interruptions
Application transactions must be handled in code—data
consistency
Exception handling for transaction retry, connection recovery
Requires coding standards, code reviews, and testing
Bulk data operations
Transaction volume impacts rollback time during failover
Batch jobs must be run on alternate nodes
Don’t bypass transaction logging
Synchronization with external data sources?
Be aware of database recovery model
Mirroring uses FailoverPartner in connection string
Use TCP/IP as client protocol
41
42. Considerations for Admins
Use identical server hardware, when possible
Design network redundancies, when feasible
Consider network latency for geographic separation
Always manage through virtual cluster, not individual cluster nodes
Retest failover/failback after HA maintenance
Diagnose after failover
Repair alternate node
Resynchronize data, as necessary
Be aware of primary/secondary locations
Ensure application services are connected and functioning properly
Keep server node configurations synchronized:
Service pack and patch levels
Duplicate non-redundant resources
Jobs; logins and permissions; OS & sys objects
42
43. HA Risks
System performance degradation
HA system complexity leads to availability
issues
Some system failures not planned for
Backup and recovery planning incomplete
Administrators not fully trained or informed
User databases not synchronized with other
data sources
43
44. Common Admin Use Cases
Maintain HA nodes
Hardware maintenance
Rolling upgrades and software patches
Resynchronize the redundant copy
Re-synch mirror
Restart log shipping
Diagnose and repair
Diagnose cause of failover
Repair failed node and restore failover
capabilities
Test failover and failback
44
45. Common Admin Actions
Train and practice administrators to:
Initiate a database mirror
Manually failover mirror database or cluster
node
Add/remove passive node from mirror or
cluster
Upgrade/patch servers nodes
Restart or redirect application services
45
47. References—Books
Microsoft SQL Server 2008 High
Availability with Clustering &
Database Mirroring
by Michael Otey, 2009.
Microsoft SQL Server High
Availability
by Paul Bertucci, 2004.
Pro SQL Server 2005 High
Availability
by Allan Hirt, 2007.
Pro SQL Server 2005 Replication
by Sujoy Paul, 2006.
Pro SQL Server 2005 Service
Broker
by Klaus Aschenbrenner, 2007.
The Rational Guide to SQL
Server 2005 Service Broker
by Roger Wolter, 2006.
High Availability Related Topics
47
48. References—Presentations
48
Microsoft Load Balancing and Clustering
http://ce.sharif.edu/courses/84-85/2/ce317/resources/root/lecture%20slides/
14.%20Microsoft%20Load%20Balancing%20and%20Clustering.ppt
SQL Server 2005 High Availability
http://www.atlantamdf.com/Presentations/AtlantaMDF_111207HA.ppt
High Availability Technologies In SQL Server 2000 And SQL Server 2005
http://202.181.238.2/hk/teched2004/ppt/Day_2_Rm407/DAT431(1330-1445).ppt
Meeting the Availability Challenge
http://download.microsoft.com/download/E/D/C/EDCF54DB-19CD-4882-9FC4-
4F7D46FCEAA6/HighAvailability.ppt
Disaster Recovery Mistakes
http://www.sqlsig.org/Oct%2011%20DASSUG%20-%20Jason%20Hall%2010-11-
07%20MM.ppt
SQL Server 2005 High Availability
http://blogs.msdn.com/sql2005event/attachment/564303.ashx
Effective Usage of SQL Server 2005 Database Mirroring
http://www.sqlserver-qa.net/SSQA-
Effective%20Usage%20of%20SQL%20Server%202005%20Database%20Mirroring_show
.ppt
49. References—Articles
Achieve High Availability for SQL Server
http://technet.microsoft.com/en-us/magazine/cc162477.aspx
Geographically Dispersed Clusters in Windows
Server 2003
http://www.microsoft.com/windowsserver2003/techinfo/overview/clustergeo.
mspx
Restoring file and filegroup backups
http://support.microsoft.com/kb/281122/en-us
Restoring specific tables or rows from backups
http://support.microsoft.com/kb/321836/en-us
Maintaining Availability During Upgrades
http://msdn.microsoft.com/en-us/library/ms191449.aspx
49