SlideShare una empresa de Scribd logo
1 de 37
Defining availability for
an IT service
Stuart Rance / November 2012
Twitter:   @StuartRance
Email:     stuart.rance@hp.com
Agenda

Service Warranty

Traditional view of Availability

End-to-end services and SLAs
Outage Frequency and Duration
Number of users affected
Critical business functions
Poor performance
Planned downtime
Measurement periods


How to measure availability
2   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Service Warranty




© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Service Value Comes From…

Service Utility
What does the service do?
Functional requirements
Features, inputs, outputs…
“fit for purpose”


Service Warranty
How well does the service do it?
Non-functional requirements
Capacity, performance, availability, security, continuity…
“fit for use”




4   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Service Warranty and Risks


                     high
                                                       natural disaster- fire, flood, adverse weather
                                                                       man made disaster- terrorism, malicious
                                                                       damage     security breach- hacker
                                                                                                            denial of service attack
                                                                                                                       virus attack
                                                                                                               internal security/fraud
                          impact




                                                  insufficient capacity
                                                                                                                    data corruption

                                                                                                                             configuration issues
                                                               software failure                         power/ network failure
                                                                                                      hardware failure
                                                                                                                                       application error
                                                                         planned downtime
                       low
                                     low                                                         frequency                                             high
5   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Service Warranty and Risks


                     high
                                                       natural disaster- fire, flood, adverse weather
                                                                       man made disaster- terrorism, malicious
                                                     Continuity        damage     security breach- hacker
                                                                                                                  Security
                                                                                                            denial of service attack
                                                                                                                       virus attack
                                                                                                               internal security/fraud
                          impact




                                                  insufficient capacity
                                                 Capacity                                                           data corruption

                                                                                                                             configuration issues
                                                               software failure
                                                                                                    Availability
                                                                                                        power/ network failure
                                                                                                      hardware failure
                                                                                                                                       application error
                                                                         planned downtime
                       low
                                     low                                                         frequency                                             high
6   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Traditional View of
Availability




© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Traditional View of Availability



           Percentage Availability Annual Downtime
           99%                                                                                    87.6 hours (3½ days)

           99.5%                                                                                  43.8 hours

           99.9%                                                                                  8.8 hours

           99.95%                                                                                 4.4 hours

           99.99%                                                                                 53 minutes

           99.999%                                                                                5.3 minutes



8   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
The Traditional Calculation




AST = Agreed Service Time
DT = Downtime




9   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
What’s Wrong with Tradition?

What if some locations are OK and others aren’t

What if some users are OK and others aren’t

What if some operations work and others don’t

What if the service is so slow that it is unusable?

What if there are frequent 5 second outages?

What are we actually measuring and reporting?

10   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
End-to-end Services and
SLAs




© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Where to Measure Availability?




     Database                                                  Network                                                      Server      Desktop
12   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
As Seen by the Customer / User…




13   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Service Level Agreements

An SLA documents what has been agreed
From the perspective of the users and customers


Contents should include
Availability definitions
Targets
Measurement and reporting
Penalties


Every goal in an SLA must be SMART
Specific, Measurable, Achievable, Relevant, Time-based



14   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Outage Frequency and Duration

MTBF = Mean Time Between Failures
MTBSi = Mean Time Between System Incidents
MTRS = Mean Time to Restore Service

                                                                                                      TBSi
            Up
                                                                                                                   TBF




                                                              TRS                                                                       TRS
     Down

15   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Outage Frequency and Duration

Which of these is better?

           Up
                                              MTBF = 19 days MTTR = 1 day Availability = 95%


      Dow
        n
                                           MTBF = 22.8 hrs MTTR = 1.2 hrs Availability = 95%
           Up




      Dow
        n
16   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Failover Events

How long does a failover take?
Between cluster members?
When a RAID disk fails?
When a network link fails?


Does fail over have a business impact?
Do transactions have to be restarted?
What is the longest “short” outage that can be ignored?


What if the cluster continuously fails over?
What is the maximum frequency of these types of event



17   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Outage Frequency and Duration
Summary
Agree availability in terms of
Frequency of incidents
Duration of incidents


Agree failover events which won’t be counted
Frequency
Duration
Impact




18   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
An Agreement with the Business

Outage duration and frequency must be agreed
In terms that the business understands
With metrics that support the business mission

What might such an agreement look like?




19   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example Agreement


     Outage Duration                                                                      Maximum Frequency
                                                                                          1 event in any hour
     Up to 2 minutes                                                                      3 events in any day
                                                                                          5 events in any week
                                                                                          1 events in any month
     2 minutes to 30 minutes
                                                                                          2 events in any quarter
     30 minutes to 4 hours                                                                1 event in any year



     Maximum Annual Downtime
     4 hours + (8 * 30 mins) = 8 hours
     Availability = (8760 – 8) / 8760 = 99.9%

20   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Number of Users Affected

Most failures do not cause complete loss of service

Typical scenario
Some users have no service at all
Other users completely unaffected


Extreme cases
Only one user is affected
Only one user is able to work!


Should these count as downtime or not?


21   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
User Outage Minutes




Potential User Minutes =
Number of users * Agreed service time


User Outage Minutes =
Number of affected users * Downtime




22   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Potential User Minutes

Not every minute is equal
     Day and time                                                         Potential                               Weekly PotentialUserMinutes
                                                                          no. of users

     Mon – Fri 00:00-07:00                                                500                                     5 x 7 x 60 x 500 = 1,050,000
                                                                                                                  5 x 2 x 60 x 2500        =
     Mon – Fri 07:00-09:00                                                2,500
                                                                                                                  1,050,000
                                                                                                                  5 x 9 x 60 x 5000        =
     Mon – Fri 09:00-18:00                                                5,000
                                                                                                                  13,500,000
     Mon – Fri 18:00-21:00                                                1,000                                   5 x 3 x 60 x 1000        = 900,000

     Mon – Fri 21:00-00:00                                                500                                     5 x 3 x 60 x 500         = 450,000
                                                                                                                  2 x 24 x 60 x 500        =
     Sat – Sun                                                            500
                                                                                                                  1,440,000
23   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
     WEEKLY TOTAL                                                                                                                       18,840,000
User Outage Minutes Example

Lost email service
to 500 users
for 2 hours
on a Monday morning at 10:00
UserOutageMinutes = 500 * 2 * 60 = 60,000


Using data from previous slide
PotentialUserMinutes for the week = 18,840,000
Availability = 18,840,000 – 60,000 / 18,840,000


                                                                                         99.68%

24   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
What if there aren’t users?

Transaction based system




Manufacturing system




etc.



25   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Critical Business Functions

Some failures only affect part of a service
ATMs can dispense money but not print statements
Can browse old emails but can’t send or receive
Reservation system can see bookings but not make new ones


It is up to the business to define the relative importance
of each type of transaction

You can use transaction weightings to modify
availability figures




26   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example Transaction Weightings


            IT function that is not available                                                                                           % Service
                                                                                                                                        Impact
            Sending email                                                                                                               100%

            Receiving email                                                                                                             100%
            Using shared distribution list to send
                                                                                                                                        10%
            email
            Updating shared distribution lists                                                                                          5%

            Accessing shared calendars                                                                                                  30%

            Updating shared calendars                                                                                                   10%
                                            Why don’t these add up to 100%?


27   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
What About Poor Performance?

Most SLAs have performance targets

What if performance is SO SLOW that service can’t be
used?
Some SLAs count this as downtime
Others count it separately, with its own penalties
The important thing is to discuss, agree, and document


IT can only agree performance if customer agrees
maximum workload
It is the job of the business to forecast the work, not IT



28   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Example Performance Agreement



     IT function                                                                         Required response time
                                                                                         (when service is available)
                                                                                         99% within 5 seconds
     Login
                                                                                         99.9% within 15 seconds
                                                                                         95% within 10 seconds
     Seat availability check
                                                                                         99% within 30 seconds
                                                                                         99% within 40 seconds
     Seat booking
                                                                                         100% within 60 seconds
                                                                                         95% within 20 seconds
     Check in
                                                                                         100% within 60 seconds


29   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Planned Downtime

What effect does a planned outage have on availability?

AST = Agreed Service Time

If planned outage is in a service window then it isn’t
downtime
Some SLAs specify when maintenance will happen
Some SLAs allow additional planned downtime with sufficient notice




30   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Measuring Availability




© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Measurement Period

Remember that Availability is defined as




AST = Agreed Service Time
DT = Downtime

What time period should we use for the agreed service
time?



32   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Measurement Period

Availability after a single 8 hour incident


Weekly


Monthly


Quarterly


Annual
33   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Measuring Availability

You have a good definition of Availability
It is Specific about what will be delivered
It is Achievable
It is Relevant to the service you deliver
It is defined over a clear Time period


So what have we forgotten?
A definition is of no use at all if you can’t Measure it




34   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
How can you Measure Availability

Service Desk Records
Fairly easy to implement, inexpensive
Can lead to disputes about accuracy of data


Instrument all components and calculate
Difficult to implement, expensive
May fail to detect complex or subtle failures


Use dummy transactions / clients to simulate
Actually measures end-to-end availability
May miss complex or subtle failures

Instrument applications to report end-to-end availability
Actually measures end-to-end availability
Must be included in the early stages of application design
35   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Summary

How many 9s” is not good enough

Must account for
End-to-end service availability
Number and duration of outages
Number of users or transactions affected by incidents
Criticality of business functions affected by incidents
Performance of critical functions
Planned downtime
Agreed measurement period
Agreed measurement process

Everything must be documented in an SLA
Using SMART metrics



36   © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thank you




  Twitter: @StuartRance
  Email:     stuart.rance@hp.com

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Más contenido relacionado

Similar a Stuart rance defining availability for an it service

Business Driven Security Securing the Smarter Planet pcty_020710_rev
Business Driven Security Securing the Smarter Planet pcty_020710_revBusiness Driven Security Securing the Smarter Planet pcty_020710_rev
Business Driven Security Securing the Smarter Planet pcty_020710_revShanker Sareen
 
Cso oow12-summit-sonny-sing hv4
Cso oow12-summit-sonny-sing hv4Cso oow12-summit-sonny-sing hv4
Cso oow12-summit-sonny-sing hv4OracleIDM
 
MBM's InterGuard Security Suite
MBM's InterGuard Security SuiteMBM's InterGuard Security Suite
MBM's InterGuard Security SuiteCharles McNeil
 
360-Degree Approach to DR / BC
360-Degree Approach to DR / BC360-Degree Approach to DR / BC
360-Degree Approach to DR / BCAISDC
 
Infromation Security as an Institutional Priority
Infromation Security as an Institutional PriorityInfromation Security as an Institutional Priority
Infromation Security as an Institutional Priorityzohaibqadir
 
Hp Fortify Pillar
Hp Fortify PillarHp Fortify Pillar
Hp Fortify PillarEd Wong
 
Designing your applications with a security twist 2007
Designing your applications with a security twist 2007Designing your applications with a security twist 2007
Designing your applications with a security twist 2007Blue Slate Solutions
 
Dirty Little Secret - Mobile Applications Invading Your Privacy
Dirty Little Secret - Mobile Applications Invading Your PrivacyDirty Little Secret - Mobile Applications Invading Your Privacy
Dirty Little Secret - Mobile Applications Invading Your PrivacyTyler Shields
 
Jedi mind tricks for building application security programs
Jedi mind tricks for building application security programsJedi mind tricks for building application security programs
Jedi mind tricks for building application security programsSecurity BSides London
 
DSS ITSEC Conference 2012 - Cyberoam Layer8 UTM
DSS ITSEC Conference 2012 - Cyberoam Layer8 UTMDSS ITSEC Conference 2012 - Cyberoam Layer8 UTM
DSS ITSEC Conference 2012 - Cyberoam Layer8 UTMAndris Soroka
 

Similar a Stuart rance defining availability for an it service (14)

Mobile Security
Mobile Security Mobile Security
Mobile Security
 
Mobile Security
Mobile Security Mobile Security
Mobile Security
 
Business Driven Security Securing the Smarter Planet pcty_020710_rev
Business Driven Security Securing the Smarter Planet pcty_020710_revBusiness Driven Security Securing the Smarter Planet pcty_020710_rev
Business Driven Security Securing the Smarter Planet pcty_020710_rev
 
Cso oow12-summit-sonny-sing hv4
Cso oow12-summit-sonny-sing hv4Cso oow12-summit-sonny-sing hv4
Cso oow12-summit-sonny-sing hv4
 
MBM's InterGuard Security Suite
MBM's InterGuard Security SuiteMBM's InterGuard Security Suite
MBM's InterGuard Security Suite
 
Stream 2 - Don't Risk IT
Stream 2 - Don't Risk ITStream 2 - Don't Risk IT
Stream 2 - Don't Risk IT
 
360-Degree Approach to DR / BC
360-Degree Approach to DR / BC360-Degree Approach to DR / BC
360-Degree Approach to DR / BC
 
Infromation Security as an Institutional Priority
Infromation Security as an Institutional PriorityInfromation Security as an Institutional Priority
Infromation Security as an Institutional Priority
 
Hp Fortify Pillar
Hp Fortify PillarHp Fortify Pillar
Hp Fortify Pillar
 
Nebezpecny Internet Novejsi Verze
Nebezpecny Internet Novejsi VerzeNebezpecny Internet Novejsi Verze
Nebezpecny Internet Novejsi Verze
 
Designing your applications with a security twist 2007
Designing your applications with a security twist 2007Designing your applications with a security twist 2007
Designing your applications with a security twist 2007
 
Dirty Little Secret - Mobile Applications Invading Your Privacy
Dirty Little Secret - Mobile Applications Invading Your PrivacyDirty Little Secret - Mobile Applications Invading Your Privacy
Dirty Little Secret - Mobile Applications Invading Your Privacy
 
Jedi mind tricks for building application security programs
Jedi mind tricks for building application security programsJedi mind tricks for building application security programs
Jedi mind tricks for building application security programs
 
DSS ITSEC Conference 2012 - Cyberoam Layer8 UTM
DSS ITSEC Conference 2012 - Cyberoam Layer8 UTMDSS ITSEC Conference 2012 - Cyberoam Layer8 UTM
DSS ITSEC Conference 2012 - Cyberoam Layer8 UTM
 

Stuart rance defining availability for an it service

  • 1. Defining availability for an IT service Stuart Rance / November 2012 Twitter: @StuartRance Email: stuart.rance@hp.com
  • 2. Agenda Service Warranty Traditional view of Availability End-to-end services and SLAs Outage Frequency and Duration Number of users affected Critical business functions Poor performance Planned downtime Measurement periods How to measure availability 2 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 3. Service Warranty © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 4. Service Value Comes From… Service Utility What does the service do? Functional requirements Features, inputs, outputs… “fit for purpose” Service Warranty How well does the service do it? Non-functional requirements Capacity, performance, availability, security, continuity… “fit for use” 4 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 5. Service Warranty and Risks high natural disaster- fire, flood, adverse weather man made disaster- terrorism, malicious damage security breach- hacker denial of service attack virus attack internal security/fraud impact insufficient capacity data corruption configuration issues software failure power/ network failure hardware failure application error planned downtime low low frequency high 5 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 6. Service Warranty and Risks high natural disaster- fire, flood, adverse weather man made disaster- terrorism, malicious Continuity damage security breach- hacker Security denial of service attack virus attack internal security/fraud impact insufficient capacity Capacity data corruption configuration issues software failure Availability power/ network failure hardware failure application error planned downtime low low frequency high 6 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 7. Traditional View of Availability © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 8. Traditional View of Availability Percentage Availability Annual Downtime 99% 87.6 hours (3½ days) 99.5% 43.8 hours 99.9% 8.8 hours 99.95% 4.4 hours 99.99% 53 minutes 99.999% 5.3 minutes 8 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 9. The Traditional Calculation AST = Agreed Service Time DT = Downtime 9 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 10. What’s Wrong with Tradition? What if some locations are OK and others aren’t What if some users are OK and others aren’t What if some operations work and others don’t What if the service is so slow that it is unusable? What if there are frequent 5 second outages? What are we actually measuring and reporting? 10 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 11. End-to-end Services and SLAs © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 12. Where to Measure Availability? Database Network Server Desktop 12 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 13. As Seen by the Customer / User… 13 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 14. Service Level Agreements An SLA documents what has been agreed From the perspective of the users and customers Contents should include Availability definitions Targets Measurement and reporting Penalties Every goal in an SLA must be SMART Specific, Measurable, Achievable, Relevant, Time-based 14 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 15. Outage Frequency and Duration MTBF = Mean Time Between Failures MTBSi = Mean Time Between System Incidents MTRS = Mean Time to Restore Service TBSi Up TBF TRS TRS Down 15 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 16. Outage Frequency and Duration Which of these is better? Up MTBF = 19 days MTTR = 1 day Availability = 95% Dow n MTBF = 22.8 hrs MTTR = 1.2 hrs Availability = 95% Up Dow n 16 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 17. Failover Events How long does a failover take? Between cluster members? When a RAID disk fails? When a network link fails? Does fail over have a business impact? Do transactions have to be restarted? What is the longest “short” outage that can be ignored? What if the cluster continuously fails over? What is the maximum frequency of these types of event 17 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 18. Outage Frequency and Duration Summary Agree availability in terms of Frequency of incidents Duration of incidents Agree failover events which won’t be counted Frequency Duration Impact 18 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 19. An Agreement with the Business Outage duration and frequency must be agreed In terms that the business understands With metrics that support the business mission What might such an agreement look like? 19 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 20. Example Agreement Outage Duration Maximum Frequency 1 event in any hour Up to 2 minutes 3 events in any day 5 events in any week 1 events in any month 2 minutes to 30 minutes 2 events in any quarter 30 minutes to 4 hours 1 event in any year Maximum Annual Downtime 4 hours + (8 * 30 mins) = 8 hours Availability = (8760 – 8) / 8760 = 99.9% 20 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 21. Number of Users Affected Most failures do not cause complete loss of service Typical scenario Some users have no service at all Other users completely unaffected Extreme cases Only one user is affected Only one user is able to work! Should these count as downtime or not? 21 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 22. User Outage Minutes Potential User Minutes = Number of users * Agreed service time User Outage Minutes = Number of affected users * Downtime 22 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 23. Potential User Minutes Not every minute is equal Day and time Potential Weekly PotentialUserMinutes no. of users Mon – Fri 00:00-07:00 500 5 x 7 x 60 x 500 = 1,050,000 5 x 2 x 60 x 2500 = Mon – Fri 07:00-09:00 2,500 1,050,000 5 x 9 x 60 x 5000 = Mon – Fri 09:00-18:00 5,000 13,500,000 Mon – Fri 18:00-21:00 1,000 5 x 3 x 60 x 1000 = 900,000 Mon – Fri 21:00-00:00 500 5 x 3 x 60 x 500 = 450,000 2 x 24 x 60 x 500 = Sat – Sun 500 1,440,000 23 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. WEEKLY TOTAL 18,840,000
  • 24. User Outage Minutes Example Lost email service to 500 users for 2 hours on a Monday morning at 10:00 UserOutageMinutes = 500 * 2 * 60 = 60,000 Using data from previous slide PotentialUserMinutes for the week = 18,840,000 Availability = 18,840,000 – 60,000 / 18,840,000 99.68% 24 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 25. What if there aren’t users? Transaction based system Manufacturing system etc. 25 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 26. Critical Business Functions Some failures only affect part of a service ATMs can dispense money but not print statements Can browse old emails but can’t send or receive Reservation system can see bookings but not make new ones It is up to the business to define the relative importance of each type of transaction You can use transaction weightings to modify availability figures 26 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 27. Example Transaction Weightings IT function that is not available % Service Impact Sending email 100% Receiving email 100% Using shared distribution list to send 10% email Updating shared distribution lists 5% Accessing shared calendars 30% Updating shared calendars 10% Why don’t these add up to 100%? 27 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 28. What About Poor Performance? Most SLAs have performance targets What if performance is SO SLOW that service can’t be used? Some SLAs count this as downtime Others count it separately, with its own penalties The important thing is to discuss, agree, and document IT can only agree performance if customer agrees maximum workload It is the job of the business to forecast the work, not IT 28 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 29. Example Performance Agreement IT function Required response time (when service is available) 99% within 5 seconds Login 99.9% within 15 seconds 95% within 10 seconds Seat availability check 99% within 30 seconds 99% within 40 seconds Seat booking 100% within 60 seconds 95% within 20 seconds Check in 100% within 60 seconds 29 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 30. Planned Downtime What effect does a planned outage have on availability? AST = Agreed Service Time If planned outage is in a service window then it isn’t downtime Some SLAs specify when maintenance will happen Some SLAs allow additional planned downtime with sufficient notice 30 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 31. Measuring Availability © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 32. Measurement Period Remember that Availability is defined as AST = Agreed Service Time DT = Downtime What time period should we use for the agreed service time? 32 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 33. Measurement Period Availability after a single 8 hour incident Weekly Monthly Quarterly Annual 33 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 34. Measuring Availability You have a good definition of Availability It is Specific about what will be delivered It is Achievable It is Relevant to the service you deliver It is defined over a clear Time period So what have we forgotten? A definition is of no use at all if you can’t Measure it 34 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 35. How can you Measure Availability Service Desk Records Fairly easy to implement, inexpensive Can lead to disputes about accuracy of data Instrument all components and calculate Difficult to implement, expensive May fail to detect complex or subtle failures Use dummy transactions / clients to simulate Actually measures end-to-end availability May miss complex or subtle failures Instrument applications to report end-to-end availability Actually measures end-to-end availability Must be included in the early stages of application design 35 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 36. Summary How many 9s” is not good enough Must account for End-to-end service availability Number and duration of outages Number of users or transactions affected by incidents Criticality of business functions affected by incidents Performance of critical functions Planned downtime Agreed measurement period Agreed measurement process Everything must be documented in an SLA Using SMART metrics 36 © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
  • 37. Thank you Twitter: @StuartRance Email: stuart.rance@hp.com © Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.