If SQL Server is heart of our environment, his health should be very important, right? If SQL Server is important, his availability for our businesses (internal and external) is important to. For our customers doesn't matter where data are stored, how are stored and what we do with those data. Especially for our managers. The data must be available on demand, on time, at he moment of request. High Availability is our responsibility. How we can prepare our environment for HA? How HA is connected for with SLA? And why Service Level Agreement are important for us? In this session I want to discuss about HA options for SQL Server (2008, 2012), about our different customers, and about Service Level Agreement (formal or not).
2. SELECT {BIO}
• Polish SQL Server User Group Leader
• Microsoft Certified Trainer
• MCP, MCSA, MLSS, MLSBS, MCTS, MCITP, MCT
• SQL Server MVP from 2010
• Friends of RedGate PLUS
• PASS SQL Azure Virtual Chapter Co-Founder
• Blogger, Influencer, Technical Writer
• Last 7 years (living) in Data Center in Wrocław
• Generally about 12 years in IT/banking area
• GITCA Technical Lead & Vice-Chair EMEA Board
• Speaker at SQL Server Community Launch, Time for SharePoint, CodeCamps, SharePoint
Community Launch, CISSP Day, InfoTRAMS, SQLSaturday, SQLBits, CarreerCon,
• Autor of few articles on TechNet (PL) and WSS.pl portal
• Deep Dives Co-Author:
High availability of SQL Server in the context
of Service Level Agreements (Chapter 18th)
• Working for MS Subject Matter Expert and MS Terminology community (Windows 7, 8 &
Visualstudio 2010,2011
3. Agenda
• Back to the school:
− What is High Availability
− What is Service Level Agreement
• Using HA in SQL Server 2008
• HA solutions in SQL Server 2008 that means: Enterprise, Enterprise
• Why SLA and DBA
• Dependency of SLA and HA
• Case Studies
• Q&A
4. What is High Availability?
• High Availability (HA) to ensure the continued
operation of equipment and systems for the
purposes of (usually) in an enterprise production
environment.
• Is designed to prevent data loss as a result of:
− software bugs,
− manufacturing defects
− hardware failure
− natural disasters
− human error
− other unforeseen events
7. Two kinds of monster:
PSO > USO > SLA
• PSO Planned System Outages – Planned System Unavailability
− Minimum planned unavailability, due to the need to carry out
modernization work, installing patches, replacement / extension of
hardware,
− Agreed/accepted by/with the client and not affecting the
provisions of the HA, and SLA, until
• ...USO Unplaned System Outages – Unplanned System Unavailability
− an error that prevents a partial or total work environment in a
tangible, measurable customer
− resulting in high costs if you need repairs, as well as penalty
payments for non-SLA
8. Performance metrics (HA)
• What it really is the availability of the order of 99.99%?
• Availability 99.99% to 0.01UNAVAILABILITY in a
requested period (eg annual), which ...
• How much is that in terms of the unavailability of the
server / environment / database:
Availability = MTBF / MTBF + MTTR
− MTBF -> Mean Time Between Failures
− MTTR -> Mean Time To Repair
9. Unavailability in minutes, hours, days, weeks...
Downtime Downtime Downtime
Availability %
per year per month* per week
90% 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
98% 7.30 days 14.4 hours 3.36 hours
99% 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 min
99.8% 17.52 hours 86.23 min 20.16 min
99.9% ("three nines") 8.76 hours 43.2 min 10.1 min
99.95% 4.38 hours 21.56 min 5.04 min
99.99% ("four nines") 52.6 min 4.32 min 1.01 min
99.999% ("five nines") 5.26 min 25.9 s 6.05 s
99.9999% ("six nines") 31.5 s 2.59 s 0.605 s
10. What is SLA?
• SLA - Service Level Agreement.
• The origins date back to 1980 and the agreements
between operators and end customers.
• Mutually negotiable contract for the provision of services
(not just IT, but these in particular)
• It must be concluded formally, though legally permissible
is an informal agreement
• Including the level and range of services provided by
means of measurable indicators (level of accessibility,
usability, performance)
• The contract should have specified minimum and
maximum range for each subject to its services
11. Metrics of SLA
There is no specific SLA measurement WITHOUT indicators!
SAMPLE CALL CENTER / SERVICE DESK:
• ABA (Abandonment Rate): Percentage of calls abandoned while waiting for a
response.
• ASA (Average Speed to Answer): Average time (usually in seconds) required
for the connection of boards help.
• TSF (Time Service Factor): Percentage of calls answered in precise time
frame, such as 80% in 20 seconds.
• FCR (First Call Resolution): Percentage of calls where the problem was solved
without having to switch to another expert
• TAT (Turn Around Time): The time it takes to complete certain tasks.
12. High Availability in SQL Server 2008
Microsoft SQL Server 2008/2008R2/2012:
• Database Mirroring
• Database Snapshots
• Windows Clustering
• SQL Server Replication
• Hot-add memory and CPU
• Online Index Operations
• Table and Index Partitioning
• Failover Clustering
• Peer-To-Peer Replication
• Always On
13. Solutions for HA for SQL Server
DATABASE FAILOVER TRANSACTIONAL
AREA LOG SHIPPING
MIRRORING CLUSTERING REPLICATION
some data loss
Data Loss no data loss no data loss some data loss possible possible
Automatic Failover YES (in HA mode) YES no no
YES, connect to same
Transparent To Client YES, autodirect IP no, NLB helps no, NLB helps
20 seconds or more + seconds plus time to
Downtime < 3 seconds time to recovery seconds recovery
Standby Ready Access Yes, with db snapshots no data loss YES
Data Granularity DB only all systems and db's table or view DB only
Masking of hdd failure YES No, shared disk YES YES
NO, duplicate NO, duplicate NO, duplicate
Special hardware recommended Cluster HCL recommended recommended
Complexity Some More More More
14. Why High Availability? High
Availability
MSFT SLIDE
• Businesses need to work around the clock to meet customer demands
• When systems are not running, businesses are losing revenue, opportunities,
customers and reputation
• High availability reduces the impact of required maintenance on
day-to-day operations and helps recover quickly from disasters
• Businesses need flexibility to easily build high availability solutions that meet
business and technology needs
Online operations
Multiple instance clustering
Prevent Unplanned
Downtime Live Migration
Automatic page repair with
database mirroring Reduce Planned
Downtime
Hot-add CPU and RAM
Database snapshots
Peer-to-peer replication
15. Prevent Unplanned Downtime High
Availability
MSFT SLIDE
Multiple-Instance Database
Clustering
Applications &
Business Logic 1100101
00101
0010111
1100101
0010100
1100101
00101
1100101
• More than one passive node is
available to host instances from
00101
101 00101
110010
110010 110010
multiple failovers on active nodes
• Having multiple failover nodes
provides greater availability
• Multiple instances can share the
Active Failover Offline
Active Active
same failover node, which reduces
hardware costs
• Simplified setup reduces
administrative costs
Because of the critical nature of the G4S application, CASON sets up the servers in a
failover cluster to ensure high availability.
− —CASON Case Study
16. Enhanced Database Mirroring High
Availability
MSFT SLIDE
• High Performance Mirroring
• Increase performance through
asynchronous mirroring
Applications & • Automatic Page Repair
Business Logic • Automatically detects page corruption
and retrieves data from the mirror
• Reduces downtime and
management costs
• Minimizes application changes to
correctly handle I/O errors
Principal Mirror
• Reporting from Mirror
• Increase utilization of mirror server
• Reduce need for reporting servers
“This is a really powerful enhancement because prior to this… you would have to run
DBCC CHECKDB... and that would likely mean taking downtime… With SQL Server
2008 Database Mirroring you can avoid the effort and downtime.”
— Glenn Berry, Database Architect, NewsGator Technologies
17. Help Recover From User Errors High
Availability
MSFT SLIDE
1100101
• Database Snapshots
• Provide a read-only static view of
00101
1100101
00101
110010
the database at a point in time
Applications & • Revert to a point in time before user
Business Logic error
• Data loss is limited to changes after
the snapshot
• Run reports from a snapshot created
Snapshot Source on the mirror server in a mirror to
1100101
00101
1100101
00101
better utilize resources
110010
1100101
00101
1100101
00101
110010
“Database snapshots allow you to create read-only databases for reporting and can
also be useful in your data recovery efforts in the event of a disaster.”
—Tim Chapman, SQL Server Database Administrator
18. Maintain Databases Without Downtime High
Availability
MSFT SLIDE
1100101
00101
1100101
• Online Operations
• Allow routine maintenance without
00101
110010
corresponding downtime
Applications & • Online index operations
Business Logic • Online page and file restoration
• Online configuration of peer-to-peer
nodes
• Users and applications can access
Table Index
0
5
Deleted
1
Deleted data while the table, key, or index is
4
Deleted
2
23
Deleted
3
being updated
74
5
05
6
3
7
We recommend performing online index operations for business environments that
operate 24 hours a day, seven days a week, in which the need for concurrent user
activity during index operations is vital.
— SQL Server Books Online
19. Minimize Planned Downtime and Increase EfficiencyHigh
Availability
MSFT SLIDE
• Live Migration
• Move running instances of VMs
between host servers
• Virtual machines can be moved for
Applications & maintenance or to balance workload
Business Logic 11001010
11001010
11001010
11001010
0101
0101
0101
0101
11001010
11001010
11001010
on host servers
• Perform maintenance on physical
11001010
0101
0101
0101
0101
110010
110010
110010
110010
machines without any downtime
11001010
11001010
0101
0101
11001010
11001010
0101
0101
11001010
11001010
0101
0101
11001010
11001010
0101
0101 • Requires Windows Server 2008 R2
Hyper-v
110010
110010 110010
110010
“This server already runs on our cluster solution with high
availability, but after we have tested live migration on the new
hardware, we’ll move it over to ensure optimal performance and
reliability”
20. Minimize Planned Downtime High
Availability
MSFT SLIDE
• Hot-Add CPU and RAM
• Dynamically add memory and
processors to servers without
Applications & incurring downtime
Business Logic • Requires hardware support for either
physical or virtual hardware
110010 110010
100101 100101
110010 110010
100101 100101
110010 110010
110010 110010
100101 100101
110010 110010
100101 100101
110010 110010
Hot-add CPU is the ability to dynamically add CPUs to a running system. Adding CPUs
can occur physically by adding new hardware, logically by online hardware partitioning, or
virtually through a virtualization layer.
—SQL Server Books Online
21. Access Data Seamlessly Across Servers High
Availability
MSFT SLIDE
• Peer-to-Peer Replication
• Increases reliability by replicating
data to multiple servers
Applications &
Business Logic • Provides higher availability in case
of failure or to allow maintenance
1100101
0010110
00101
0101100
1100101
1011001
00101
01
110010
at any of the participating nodes
• Offers improved performance for
each node with geo-scale
110010
100101
110010
100101
architecture
110010
1100101
00101
• Add and remove servers easily
1100101
00101
110010
without taking replication offline,
by using the new topology wizard
“[Microsoft] SQL Server 2008 replication proved to be very predictable and reliable in our testing.
This helps us to create flexible and scalable replication solutions. Reliability must be at the
foundation of all that we do.”
— Sergey Elchinsky, Leading System Engineer, Baltika Breweries
22. Database Mirroring
• Mirroring, which is a mirror image of the data
− Available only for two bases (principal, mirror)
− The desired function of a witness (witness)
• Requirements:
− principal, mirror - only SQL Server Enterprise
− witness - can be SQL Server Express
• Availability for the database:
− copy of the database on a different physical server and / or virtual
• Availability for the system:
− A copy of the entire environment on a different physical server and
/ or virtual
23. Database Mirroring Refresher Synchronous Mode
KEY POINT: mirror
database is an EXACT
copy of the principal
1 Acknowledge
Commit
7 Acknowledge
6
Constantly
2 redoing on
mirror
2 Transmit to mirror 4
Write to
local log Committed Write to
3 in log remote log
5
DB
DB
Log Log
24. Hot-add memory and CPU
• In SQL Server 2005 added the ability to use memory to be added "on
the fly"
• In SQL Server 2008 extends the dynamic capabilities of SQL Server
work, allowing you to hot add CPU
• "Hot-add" is the ability to connect the RAM / CPU to the computer
while the computer is running, and then by refreshing the SQL Server
to use the new equipment ONLINE
• The equipment must support hot-add (of course!)
− Supported only in the Enterprise Edition running on a 64-bit version of Windows
Server 2008 Datacenter / Enterprise
− SQL Server does not automatically start using the new processor / memory
− The need to reconfigure run
− Already running query will not use the newly added memory / processor.
25. Hot-Add CPU: Affinity Masks
• Affinity masks control which CPUs are used by SQL Server, and
for what purpose
• Any affinity masks will need to be updated after hot-adding
new CPUs
− If the affinity mask is set to non-zero, you will need to
update it so that SQL Server knows it can use the new CPUs.
− On systems with > 32 CPUs, you will need to set the
affinity64 mask to pick up the new CPUs
− If you want to use the new CPUs for IO only, you must add
the relevant bits to the affinity I/O (or affinity64 I/O) mask
• If questioned about affinity masks
− All zeroes means that Windows decides which CPUs are used
− Non-zero: single bit per CPU, if bit is 1, SQL Server will use it
− bit cannot be set in affinity AND affinity I/O mask
26. Fast Manual Failover
• High Security mode (synchronous mirroring without witness),
manual failover is always used
• SQL Server 2005, if there is an emergency situation, the
database on the mirror is closed and restarted to force the to
recover non-commited transaction log
− This can greatly increase the failover time
− Consider a database with hundreds of files, which all have to be opened to
start the sequence database
• SQL Server 2008 removes this step, thus speeding up and
reducing the use of emergency shutdown
27. SEND and REDO queues
Time
Amount of log not
Amount of log sent to mirror sent to mirror
Represents
possible data loss
Log to redo on mirror • SEND Queue
Represents failover time
• Unsent log
• REDO Queue
Mirror
28. Peer-to-Peer Topology (?)
• In SQL Server 2005 introduces the ability to use solution peer-to-peer
(or "two-way") Transactional Replication
− A great way to scale the resources needed to work
− Partialy as a way to have "undue copy"
• One major drawback - changing the topology of peer-to-peer
required to stop ALL activity on the servers in the topology tree
• In SQL Server 2008,
− these restrictions have been removed (in most cases),
− Setup Wizard also upgraded peer-to-peer network in SSMS
− Switching partitions can be repeated
29. Topology Wizard
• The wizard now is graphical, with drag-n-drop functionality for making
topology connections
30. SQL Server 2012 & AlwaysOn | marketing
• Help reduce planned and unplanned downtime with the new
integrated high availability and disaster recover solution, SQL Server
AlwaysOn.
• Simplify deployment and management of HA requirements using
integrated configuration and monitoring tools.
• Improve IT cost efficiency and performance using Active Secondary.
• Reduce planned downtime with Windows Server Core.
31. SQL Server 2012 & AlwaysOn | technical
AlwaysOn Failover Cluster Instances
As part of the SQL Server AlwaysOn offering, AlwaysOn Failover Cluster Instances leverages Windows Server Failover
Clustering (WSFC) functionality to provide local high availability through redundancy at the server-instance
level—a failover cluster instance (FCI). An FCI is a single instance of SQL Server that is installed across
Windows Server Failover Clustering (WSFC) nodes and, possibly, across multiple subnets. On the network, an
FCI appears to be an instance of SQL Server running on a single computer, but the FCI provides failover from
one WSFC node to another if the current node becomes unavailable.
AlwaysOn Availability Groups
AlwaysOn Availability Groups is an enterprise-level high-availability and disaster recovery solution introduced
in SQL Server 2012 to enable you to maximize availability for one or more user databases. AlwaysOn
Availability Groups requires that the SQL Server instances reside on Windows Server Failover Clustering
(WSFC) nodes.
Database mirroring
Avoid using this feature in new development work, and plan to modify aplications that currently use this
feature. We recommend that you use AlwaysOn Availability Groups instead. Database mirroring is a solution
to increase database availability by supporting almost instantaneous failover. Database mirroring can be used
to maintain a single standby database, or mirror database, for a corresponding production database that is
referred to as the principal database. For more information, see Database Mirroring (SQL Server).
Log shipping
Like AlwaysOn Availability Groups and database mirroring, log shipping operates at the database level. You
can use log shipping to maintain one or more warm standby databases (referred to as secondary databases)
for a single production database that is referred to as the primary database. For more information about log
shipping, see About Log Shipping (SQL Server).
33. SLA - what does this have to do with the DBA
• Production hours:
− Hours in which the partition / table / database must be available
− May be different for different parts of a database, for example, depending on the
application
• The percentage of time the service:
− The percentage of time within (time range) when the service / partition / table /
database is available
• Hours reserved for downtime:
− These advance hours of downtime (technical break) facilitate the work of users
− Methods Customer Support
− The response time from the HelpDesk
− DBA response time for an event
34. SLA - what does this have to do with the DBA
• Number of users on the system
− Number of transactions processed per unit of time
• Acceptable performance levels for access to the various operations
− Minimum time required to replicate the different servers
• Deadline for data recovery from failures
− Accidental deletion of data
− Damage to the database
− SQL Server Crash
− OS Server Crash
• Time it takes to read the data on the web (eg read / write table sales)
so that it was possible to continue the sale
− Maximum amount of space
− Maximum amount of tables / databases
− Number of users in specific roles
35. Why SLA is so important?
• In fact, it's more than just a signed agreement between the client and
your boss.
• It is also a contract that YOU need to meet
• If it's signed an agreement to zero downtime and zero data loss
(abstraction?) Then you need to make sure that if corruption can fulfill
this contract (change / delete data on purpose by the authorized user).
• If you can not meet the SLA, the business is exposed to downtime and
data loss
• The end result is to submit your CV to a recruitment agency ...
36. Do you think you can meet your Service Level Agreement?
• You need to know what are the conditions / requirements for
SLA if you meet them
• As you can accomplish if you do not know that there is an SLA?
• As you review the contract if you did not invite anyone to the
meeting on the creation of a Service Level Agreement?
• The end result is to submit your CV to a recruitment agency ...
37. Do you think you can meet your SLA?
• The recovery plan looks great on paper - but if ever you test it?
• Suppose this situation:
− We allow 15 minutes is not available for database size of 100 GB.
− We are able to within the last 15 minutes substitute a copy of the user
database
− What will you do in case of damage to the database?
− What will you do in the event of disk failure?
− What will you do in case of burning the motherboard?
− What do you do when cutting the cable FC?
− How much time it will take to recover from a backup?
− How much time it will take to bring tape with backup from a second
location 25 kilometers away in the city center at 14?
Do you still meet the SLA 15 minutes of downtime?
39. Summary
• You need to know about the existence of SLA
• You must take part in a Service Level Agreement
(requirements / features / technology)
• You need to have contingency plans - TESTED
• You must have knowledge of their responsibilities
• You must be able to meet the technical SLA