Stuart rance defining availability for an it service

Defining availability for
an IT service
Stuart Rance / November 2012
Twitter: @StuartRance
Email: stuart.rance@hp.com

Agenda

Service Warranty

Traditional view of Availability

End-to-end services and SLAs
Outage Frequency and Duration
Number of users affected
Critical business functions
Poor performance
Planned downtime
Measurement periods

How to measure availability
2 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Service Warranty

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Service Value Comes From…

Service Utility
What does the service do?
Functional requirements
Features, inputs, outputs…
“fit for purpose”

Service Warranty
How well does the service do it?
Non-functional requirements
Capacity, performance, availability, security, continuity…
“fit for use”


Service Warranty and Risks

high
natural disaster- fire, flood, adverse weather
man made disaster- terrorism, malicious
damage security breach- hacker
denial of service attack
virus attack
internal security/fraud
impact

insufficient capacity
data corruption

configuration issues
software failure power/ network failure
hardware failure
application error
planned downtime
low
low frequency high

Service Warranty and Risks

high
natural disaster- fire, flood, adverse weather
man made disaster- terrorism, malicious
Continuity damage security breach- hacker
Security
denial of service attack
virus attack
internal security/fraud
impact

insufficient capacity
Capacity data corruption

configuration issues
software failure
Availability
power/ network failure
hardware failure
application error
planned downtime
low
low frequency high

Traditional View of
Availability


Traditional View of Availability

Percentage Availability Annual Downtime
99% 87.6 hours (3½ days)

99.5% 43.8 hours

99.9% 8.8 hours

99.95% 4.4 hours

99.99% 53 minutes

99.999% 5.3 minutes


The Traditional Calculation

AST = Agreed Service Time
DT = Downtime


What’s Wrong with Tradition?

What if some locations are OK and others aren’t

What if some users are OK and others aren’t

What if some operations work and others don’t

What if the service is so slow that it is unusable?

What if there are frequent 5 second outages?

What are we actually measuring and reporting?


End-to-end Services and
SLAs


Where to Measure Availability?

Database Network Server Desktop

As Seen by the Customer / User…


Service Level Agreements

An SLA documents what has been agreed
From the perspective of the users and customers

Contents should include
Availability definitions
Targets
Measurement and reporting
Penalties

Every goal in an SLA must be SMART
Specific, Measurable, Achievable, Relevant, Time-based



MTBF = Mean Time Between Failures
MTBSi = Mean Time Between System Incidents
MTRS = Mean Time to Restore Service

TBSi
Up
TBF

TRS TRS
Down



Which of these is better?

Up
MTBF = 19 days MTTR = 1 day Availability = 95%

Dow
n
MTBF = 22.8 hrs MTTR = 1.2 hrs Availability = 95%
Up

Dow
n

Failover Events

How long does a failover take?
Between cluster members?
When a RAID disk fails?
When a network link fails?

Does fail over have a business impact?
Do transactions have to be restarted?
What is the longest “short” outage that can be ignored?

What if the cluster continuously fails over?
What is the maximum frequency of these types of event


Summary
Agree availability in terms of
Frequency of incidents
Duration of incidents

Agree failover events which won’t be counted
Frequency
Duration
Impact


An Agreement with the Business

Outage duration and frequency must be agreed
In terms that the business understands
With metrics that support the business mission

What might such an agreement look like?


Example Agreement

Outage Duration Maximum Frequency
1 event in any hour
Up to 2 minutes 3 events in any day
5 events in any week
1 events in any month
2 minutes to 30 minutes
2 events in any quarter
30 minutes to 4 hours 1 event in any year

Maximum Annual Downtime
4 hours + (8 * 30 mins) = 8 hours
Availability = (8760 – 8) / 8760 = 99.9%


Number of Users Affected

Most failures do not cause complete loss of service

Typical scenario
Some users have no service at all
Other users completely unaffected

Extreme cases
Only one user is affected
Only one user is able to work!

Should these count as downtime or not?


User Outage Minutes

Potential User Minutes =
Number of users * Agreed service time

User Outage Minutes =
Number of affected users * Downtime


Potential User Minutes

Not every minute is equal
Day and time Potential Weekly PotentialUserMinutes
no. of users

Mon – Fri 00:00-07:00 500 5 x 7 x 60 x 500 = 1,050,000
5 x 2 x 60 x 2500 =
Mon – Fri 07:00-09:00 2,500
1,050,000
5 x 9 x 60 x 5000 =
Mon – Fri 09:00-18:00 5,000
13,500,000
Mon – Fri 18:00-21:00 1,000 5 x 3 x 60 x 1000 = 900,000

Mon – Fri 21:00-00:00 500 5 x 3 x 60 x 500 = 450,000
2 x 24 x 60 x 500 =
Sat – Sun 500
1,440,000
WEEKLY TOTAL 18,840,000

User Outage Minutes Example

Lost email service
to 500 users
for 2 hours
on a Monday morning at 10:00
UserOutageMinutes = 500 * 2 * 60 = 60,000

Using data from previous slide
PotentialUserMinutes for the week = 18,840,000
Availability = 18,840,000 – 60,000 / 18,840,000

99.68%


What if there aren’t users?

Transaction based system

Manufacturing system

etc.


Critical Business Functions

Some failures only affect part of a service
ATMs can dispense money but not print statements
Can browse old emails but can’t send or receive
Reservation system can see bookings but not make new ones

It is up to the business to define the relative importance
of each type of transaction

You can use transaction weightings to modify
availability figures


Example Transaction Weightings

IT function that is not available % Service
Impact
Sending email 100%

Receiving email 100%
Using shared distribution list to send
10%
email
Updating shared distribution lists 5%

Accessing shared calendars 30%

Updating shared calendars 10%
Why don’t these add up to 100%?


What About Poor Performance?

Most SLAs have performance targets

What if performance is SO SLOW that service can’t be
used?
Some SLAs count this as downtime
Others count it separately, with its own penalties
The important thing is to discuss, agree, and document

IT can only agree performance if customer agrees
maximum workload
It is the job of the business to forecast the work, not IT


Example Performance Agreement

IT function Required response time
(when service is available)
99% within 5 seconds
Login
99.9% within 15 seconds
Seat availability check
Seat booking
Check in


Planned Downtime

What effect does a planned outage have on availability?


If planned outage is in a service window then it isn’t
downtime
Some SLAs specify when maintenance will happen
Some SLAs allow additional planned downtime with sufficient notice


Measuring Availability


Measurement Period

Remember that Availability is defined as

DT = Downtime

What time period should we use for the agreed service
time?


Measurement Period

Availability after a single 8 hour incident

Weekly

Monthly

Quarterly

Annual

Measuring Availability

You have a good definition of Availability
It is Specific about what will be delivered
It is Achievable
It is Relevant to the service you deliver
It is defined over a clear Time period

So what have we forgotten?
A definition is of no use at all if you can’t Measure it


How can you Measure Availability

Service Desk Records
Fairly easy to implement, inexpensive
Can lead to disputes about accuracy of data

Instrument all components and calculate
Difficult to implement, expensive
May fail to detect complex or subtle failures

Use dummy transactions / clients to simulate
Actually measures end-to-end availability
May miss complex or subtle failures

Instrument applications to report end-to-end availability
Actually measures end-to-end availability
Must be included in the early stages of application design

Summary

How many 9s” is not good enough

Must account for
End-to-end service availability
Number and duration of outages
Number of users or transactions affected by incidents
Criticality of business functions affected by incidents
Performance of critical functions
Planned downtime
Agreed measurement period
Agreed measurement process

Everything must be documented in an SLA
Using SMART metrics


Thank you

Twitter: @StuartRance
Email: stuart.rance@hp.com


Stuart rance defining availability for an it service

Recomendados

Recomendados

Más contenido relacionado

Similar a Stuart rance defining availability for an it service

Similar a Stuart rance defining availability for an it service (14)

Stuart rance defining availability for an it service