Australian organisations are taking their information technology disaster recovery seriously. However, with many organisations currently focused on cost reduction, unrealised opportunities exist to achieve disaster recovery objectives more economically.
In the CERTITUDE 2012 Information Technology Disaster Recovery Survey report released on the 8th of November, Eric Keser, a director and principal consultant with CERTITUDE Technology Risk Services, said, “Australian organisations spend about three percent of their annual IT budget on disaster recovery. However, spending well above the average on disaster recovery does not necessarily provide greater protection against system outages.”. Some of the Survey respondents spend more than ten percent of their annual IT budget on disaster recovery. Despite this budget, these respondents experienced about twelve percent of all outages in the past two years reported in the Survey.”.
The Survey is the first of its kind conducted by CERTITUDE. It specifically focused on the disaster recovery practices of Australian organisations. Keser said, “The Survey shows that there are many opportunities for Australian organisations to get more from their IT disaster recovery expenditure.”.
This is consistent with the movement CERTITUDE has seen in recent years, where its clients not only are asking for help to design disaster recovery solutions, but also to find ways to improve the cost efficiency of recovery implementation and maintenance.
For example, the Survey found that up to seventeen percent of respondents reported system disruptions caused by the failure of third-party service providers (e.g. electricity, IT operations, or telecommunications providers). Keser said, “This highlights the opportunity, at nominal cost, to reduce such causes by improving the integration of disaster recovery into existing service level and third-party management processes.”.
“IT disaster recovery is poorly embedded into other processes, with forty percent or less of respondents having embedded disaster recovery into project management, service level management, the service desk, and third-party management processes. These are existing IT processes that could help prevent, or minimise the harm caused by, the common causes of outages.”, said Keser.
Keser said, “Better internal controls can prevent other causes of system disruptions reported in the Survey as well.”. These causes often relate to failures in change management, capacity planning, and IT environmental management. Such processes are all usually within the organisation’s direct control, and therefore should not be costly to improve. Yet as few as thirty percent of respondents identify and evaluate the performance of these key disaster recovery controls.
About knowing how much recovery capability is needed, Keser said, “Most respondents involve their users in the determination of disaster recovery requirements. However
2. DEMOGRAPHICS
Organisations
operating in Australia
12 of the 19 ANZSIC
Industries
Representation of all
employee sizes
All annual IT
spend, except for
$0.5m to $1m
Certitu
3. BUDGET
Respondents spend around
DR Budget (% of IT) 3% of their IT budget on
disaster recovery. However
money doesn’t necessarily
buy fewer IT outages.
Most outages reported
by those who spent 1%
of their IT budget on DR
Outages vs DR Spend Respondents who spent
> 10%, incurred 12% of
all outages reported
Those with IT budgets
<=$100k, spent nearly
nothing on DR
Certitu
4. RECOVERY LOCATION
Small and / or
geographically non-
Location dispersed organisations
Have difficulty finding
suitable recovery locations.
Most respondents
(55.88%) recover to the
same city
Size & geographical
Recovery Site Location presence have a
significant influence on
recovery location
Respondents who have
a regional presence are
taking full advantage of
their geographical
diversity
Certitu
5. MATURITY
Higher levels of disaster
Maturity recovery maturity can
reduce system disruption.
Most describe their DR
maturity as
‘repeatable, but
intuitive’, or ‘defined’
Outages vs Maturity Size does not influence
maturity.
The higher the
maturity, the lower the
number of outages and
harm (e.g. average and
longest duration)
Certitu
6. STANDARDS & REGULATIONS
Disaster recovery
Standards / Guidelines standards and guides do
not significantly influence
most organisations’
disaster recovery.
Standards have no
significant influence on
disaster recovery
Broader standards have
greater influence than
Regulation / Legislation DR specific ones
There are changes to
APRAs Practice
Standards that affect DR
Certitu
7. PROCESS INTEGRATION
Disaster recovery is poorly
Where DR is Embedded embedded into project and
service level management,
As well as service desk
processes.
Most have DR
embedded into IT
Service Continuity, ICT
Infrastructure, Availabili
ty, Change, Incident, Sec
urity & Financial
Management
Few have DR embedded
into
Release, Management,
Service Desk and
Service Level
Management!
Certitu
8. THREATS
Trends learned from
incident & problem
management are not often
Where DR Threats are Identified used to identify DR threats
& opportunities to prevent
future system disruption.
Most use various forms
of risk assessment to
identify threats
Few (<30%) use
information recorded by
incident and problem
management processes
to identify threats
Certitu
9. KEY CONTROLS
The management of service
levels and 3rd-party service
providers is being missed to
control disaster recovery
Manage Changes Manage Physical risk.
Environment
Few evaluate important
DR controls such as
managing
performance, capacity
Manage Performance Manage Problems
and problems
& Capacity
Even fewer recognise
the importance of
managing service
levels, and third-party
providers.
Define & Manage Manage Third-
Service Levels Party Providers
Identifie
d, but…
Certitu
Identifie
Not
Identifi…
d and…
10. DISRUPTIONS
Outages
Nearly half experienced
unplanned outages in
Average (hrs) the past 2 years
Direct correlation
between maturity, and
outage frequency and
duration
Longest (hrs)
Certitu
11. DISRUPTIONS
Many system disruptions
Root Causes are essentially self-inflicted..
Many causes of
disruption can be
controlled by processes
that affect outages are
in the direct control of
the organisation
Processes that help
manage 3rd-parties are
neglected even though
many outages are
caused by third-parties
Certitu
12. RECOVERY REQUIREMENTS
Users are involved in
RTO Considerations determining disaster
recovery requirements.
Work-arounds, and
system dependencies
are well considered
The re-entry and
RPO Considerations processing of lost
data, and the clearing of
any work backlog is not
well considered
Certitu
13. EXPECTATIONS & IMPACT
The most difficult area of
Expectation Management harm to quantify,
reputation, is of
the greatest concern.
Users are involved but
expectations are not well
managed
Reputational damage
Areas of Harm was of high concern, and
is the most difficult to
actually measure, and
quantify
Operational and
financial impacts also
ranked highly
Certitu
14. DESIGN & TECHNOLOGY
Technologies in production
Use of DR Architecture are well utilised for recovery
capability. However, use of
DR architecture is not wide
spread.
Only 75% of respondents
make good use of the DR
architecture
Use of Production Technologies
12% have no DR
architecture at all
Most make good use of
existing technologies in
their production
environment
Cloud-based services not
popular
Certitu
15. DOCUMENTATION
Plans are often out of date,
and supporting
Documentation Status documentation is often
unidentified or unavailable.
38% review or update
their documentation at
least once every year.
94% use generic word
processing tools to
Documentation Tools document their disaster
recovery plans
Supporting
documentation is often
neglected
Certitu
16. TRAINING
Many respondents use
disaster recovery testing as
Training Frequency the primary method of
training.
47% have never
conducted disaster
recovery training
Some considered regular
Training Methods disaster recovery testing
to be the best form of
training
Certitu
17. TESTING
Few (34%) of respondents
Testing Frequency have their recovery test
independently
evaluated and reported.
Most test at least once
every year (note
Testing Methods APRA)
8% do no testing at all
A wide range of
testing methods are
used, with failover to
DR site the most
popular
Certitu
This is the first information technology disaster recovery survey (the Survey) that Certitude has conducted. Certitude surveyed numerous organisations in Australia from a wide range of industries. The Survey specifically focused on the disaster recovery practices of Australian organisations, and therefore presents findings that are most relevant to the Australian market. In August and September 2012, respondents completed the online Survey which asked a number of questions concerning Information Technology Disaster Recovery (DR) in their organisation.The results of the Survey indicate that, broadly, disaster recovery in Australian organisations is well managed. However, with many organisations currently focused on cost reduction, opportunities exist that could enable organisations to achieve their disaster recovery objectives more economically. Some of these opportunities are illustrated in the key findings of the Survey.
The majority of the total IT outages reported inthe past two years were experienced byrespondents who spent around 1% of their ITbudget on disaster recovery. However, asubstantial number (around 12%) of the totaloutages reported were experienced byrespondents who spent a relatively largeproportion of their IT budget (more than 10%) ondisaster recovery.Of the respondents with an annual IT budgetof less than or equal to $100,000, close to 0% ofthe annual IT budget was spent on disasterrecovery. In comparison, respondents withan annual IT budget of more than $500m spentover 10% of their annual IT budget on disaster recovery.On average, the percentage of annual IT budgetspent on disaster recovery is around 3%
The majority of respondents (55.88%) recovertheir systems to a location within the same city.Organisational size (i.e. by number of employees)and geographical presence appear to have asignificant influence on recovery location. Smallorganisations typically recover locally or withinthe same city. This illustrates a problem thatsmall and/or geographically non-dispersedorganisations encounter. They do not own, andtherefore have no easy access to, other suitablerecovery locations, and the cost to subscribe tothird-party recovery facilities may be prohibitivefor these organisations.In contrast, respondents who have a regionalpresence appear to be taking full advantage oftheir geographical diversity by recovering tofacilities they own in other locations.
The majority of respondents described thematurity of their disaster recovery as ‘repeatable,but intuitive’, or ‘defined’. Around 2.5% of respondents described the maturity of theirdisaster recovery as ‘optimised’. The size of anorganisation does not appear to influence maturity.However, there were notable differences inmaturity across different respondent industries.Respondents from mining, manufacturing,transport and storage, and communicationservices, on average, described their maturityas ‘repeatable, but intuitive’ or lower. Thefinancial services, education, health andcommunity services, and professional servicesindustries, on average, described their maturityas ‘defined’ or higher.In the past two years, the following percentage ofrespondents, by maturity, experienced an outage:- ‘Optimised’ = 0%- ‘Managed and Measurable’ or ‘Defined’ = 33.3%- ‘Repeatable, but Intuitive’ = 50%- ‘Initial/Adhoc’ = 100%.In appears that improving the maturity of anorganisation's disaster recovery is likely to reducesystem disruption.
For the most part, it appears that existing disasterrecovery relevant standards, guidelines, regulation,and legislation have no real influence on organisations’ disaster recovery.Particularly interesting, is that broader standards andguidelines such as ISO 27001 and ISO 22320 appear tobe of greater influence than disaster recovery andBusiness Continuity Management (BCM) specificstandards and guidelines such as AS/NZS 5050 and theAustralian National Audit Office’s (ANAO’s) BCM PracticeGuide.Note: APRA's APS / LPS 232 and GPS 222, have all been supersededby CPS 232 as at 1 July 2012. Some of the changes to be awareof include:a) A regulated institution cannot just perform a BIA for critical business operations. It must perform the analysis for all operations in order to determine which are critical.b) Clarifications concerning the role and obligations of the board (or equivalent) in complying with the standards.c) An extension to the standard to include registered life Non Operating Holding Companies (NOHCs). d) A greater clarity around the application of the standard to foreign branches.e) A new requirements for life companies to conduct periodic reviews of their business continuity plans using internal auditors or external experts.f) Under CPS 232, new powers for APRA to request that an external expert undertakes an assessment of BCM arrangement for ADIs and general insurers.g) A new requirements for Level 2 insurance groups to comply with the Prudential Standard GPS 222 Risk Management: Level 2 Insurance Group BCM requirements.
Most respondents (over 50%) have disasterrecovery mostly or completely embedded into their IT Service Continuity, ICT Infrastructure,Availability, Change, Incident, Security andFinancial Management processes.Few (around 44%) have disaster recovery embeddedinto their Project Management processes. Fewerstill (less than 40%) have embedded disasterrecovery into other important processes such asRelease, Management, Service Desk and ServiceLevel Management processes.Embedding disaster recovery activities intoeveryday IT processes, can help achieve disasterrecovery objectives in a very cost efficient manner, and improve disaster recovery awareness acrossthe organisation.Embedding disaster recovery into existing ITprocesses, may negate the need to maintain astandalone disaster recovery process that maybecome neglected over time. For example, embedding disaster recovery considerationsand sign-off in change requests, may reduce thepossibility that a production change will reducethe disaster recovery capability. Doing this mayalso prevent a new system being commissionedwithout an established disaster recovery solution.
The majority of respondents identify threatsto IT service continuity by using disasterrecovery specific risk assessments, broaderIT risk assessments, or enterprise-wide riskassessments.Few (less than 30%) used information recordedby their incident and problem managementprocesses to identify threats. This represents a missed opportunity to analysepast threats and then to improve risk mitigationactivities in order to prevent future reoccurrence.
Most respondents identify and evaluate severalkey controls that can protect against unplannedsystem outages. These include; Manage Changes,Ensure System Security, Enterprise-wide BusinessContinuity Planning, Manage the PhysicalEnvironment, and Manage Operations. However, many respondents had only identified, but notevaluated, other important key controls such asmanaging performance, capacity and problems.Significantly, many respondents did not appearto recognise the importance of having andensuring the operational effectiveness of keycontrols related to managing service levels, andthird-party providers.In addition, some respondents do not identify problem management as an important disasterrecovery control. These respondents mayexperience unnecessary harm, due to not identifying potential causes of disruption, or not escalating minor issues appropriately before theycause a disruption.The identification and validation of key controlscan often significantly, and cost effectively, reducethe likelihood and consequences of systemdisruption.
Nearly half of the respondents (47.06%) hadexperienced a major and unplanned systemdisruption in the past two years. Of these,most experienced an average outage of one tofive hours, and a longest outage of less than12 hours (half a day). 6.25% of the respondentsexperienced one or more outages of greater than72 hours.While service providers and vendor hardwarefailures caused a significant number of thereported disruptions, areas that arepredominately in the direct control of anorganisation caused a notable number. These could fairly be regarded as ‘self-inflicted’ as theyrelate to failures in change management, capacityplanning, and IT environmental management (see red coloured root causes on the chart below).
Encouragingly, most respondents determinetheir disaster recovery requirements withrepresentation from users through a BusinessImpact Analysis (BIA). Also, most respondentsconsider important factors, such as work-arounds,and system dependencies, when determiningRecovery Time Objectives and Recovery PointObjectives. However, nearly half the respondents had notadequately considered the re-entry and processingof lost data, and the clearing of any work backlog. This may indicate that while users were involved inthe determination of requirements, theirengagement may have been inadequate. This maylead to:a) A gap between disaster recovery capability and business expectations, and over or under investment in capability;b) Inaccurate or incomplete MAOs, RTOs and RPOs;c) Noncompliance with relevant regulations and law.
Despite a high participation of users in thedetermination of disaster recovery requirements,overall user expectations appear to be poorlymanaged. Over half the respondents thoughtthat they partially managed unrealistic recoveryexpectations, if at all.Failing to manage unrealistic expectations maylead to dissatisfied users, and unnecessaryexpenditure on disaster recovery implementationand maintenance. It can also diminish theimportance of user responsibilities in minimisingthe harm caused by system disruption (e.g. through the deployment of work-arounds).Of all the potential areas of damage caused byunplanned system outages, reputationaldamage was of high concern for the greatestnumber of respondents. Approximately 72%stated that their organisation’s reputationwould be either completely or mostly harmedif an unplanned system disruption occurred.The recognition that reputational damage issignificant to many organisations presents a smallproblem in building a business case for disasterrecovery. Unlike other typical areas of harm,reputational damage is the most difficult to actually measure, and quantify.Reputational harm was closely followed by theoperational and financial impacts that couldcause the most harm to the respondents’organisations.
Most respondents (approximately 70%) havesome form of disaster recovery architecture,however only around 75% of these make goodgood use of it. Around 12% of respondents eitherhad no disaster recovery architecture, or were intending to develop one.Despite the availability of cloud services,most respondents do not use cloud-basedbackup services. Automation tools specific to disaster recovery are also not widely used.. Leveraging technologies that already exist in anorganisation’s production environment canprovide improved and cost effective recoverycapability. Of all the technologies presented inthe survey, the majority of respondents (80% ormore) have made use of technologies that alreadyexist in their production environments. Theseinclude: database replication, off-site tape backup,and virtualisation. Other technologies widely usedto aid recovery include; disk/host-based backup, host failover clustering, in built applicationrecovery tools (e.g. Exchange2010, SharePoint), load-balancing, and SANreplication.
Around 6% of respondents said that they havenever reviewed or updated their disasterrecovery documentation. In contrast, about 38%of respondents review or update theirdocumentation at least once every year. Somerespondents also review or update their disasterrecovery documentation as a continuous part oftheir change management process, either bi-monthly, or when specified by their customers.The majority of respondents (around 94%) use generic word processing tools to document theirdisaster recovery plans and associateddocumentation. Around half of the respondentsalso use generic systems such as their intranetsand document management systems to publishand maintain their documentation.Cloud based services have not gained popularity,with no respondent reporting using services tostore and disseminate disaster recoverydocumentation . About 6% of respondents use other tools such as their CMDB and ServiceManagement Software.
Surprisingly, about 47% of respondents said thatthey have never conducted disaster recovery training. This may be because some respondentsconsidered regular disaster recovery testing tobe the best form of training.In contrast to the above, one respondent said thatthey conducted training bi-monthly.Some respondents conducted on-the-jobtraining.