Incident-vs-Problem Management White Paper

Nova Corporation White Paper

Incident Management and
Problem Management
DESS Task Order 1
Dan Goebel, ITIL Expert

Nova Corporation 
2/14/15 FOUO 1

Dan Goebel ITIL Expert

Nova Corporation

February 14, 2015
Incident vs. Problem Management
Background
For an IT shop using IT Service Management as a means to deliver IT services to the business,
the processes of Incident and Problem management represent the most often used processes.
As such there sometimes arise some confusion as to when one ends and the other starts.

In this paper we’ll explore the processes individually and when the handoff should occur. We’ll
also explore Metrics, Key performance Indicators, Critical Success Factors, and a simple out of
the box process flow for each one. Additionally we will look at inputs and outputs for each of
these processes.

This white paper should go a long way in helping DISA ensure that services get restored
quickly and that problems are resolved to prevent further incidents.

Appendix A and B have cut-sheets (one page summaries) for both Incident and Problem
Management respectively

Incident Management
An incident is defined as an unplanned interruption to an IT service or reduction in the quality
of an IT service. Failure of a configuration item that has not yet impacted service is also an
incident, for example failure of one disk from a mirror set.

The management of incidents almost always occurs via interaction with the Service Desk, the
Service Desk is defined as a function and stands on it’s own and is not a part of the Incident
Management Process, though tightly integrated.

The goal is to restore normal function or service operation as quickly as possible with minimal
impact. This is to ensure that Service Level Agreements (SLAs) are met. The SLA may often
refer to availability and this is where Incident Management helps to meet those thresholds.

2/14/15 FOUO 2

What triggers an incident can be an event or series of events that disrupt service. These
events may be reported via the Service Desk, via users, or through an Event Management tool.
One must keep in mind that there are events that do not disrupt service and therefore do not
cause an incident.

Key to successful Incident Management is to have an Incident Management Model. this model
should include:

• Necessary steps to be taken to handle an incident

• Order these steps should be taken in

• Responsibilities, possibly through a RACI chart

• Timescale and thresholds for completion of actions

• Escalation procedures

• Necessary evidence preservation activities. This is especially important in security related
events

Incidents will need to be identified, logged, categorized, prioritized, diagnosed, escalated,
resolved, and finally closed. This is illustrated in the following generic flow chart:

2/14/15 FOUO 3

One of the problems at DISA is that escalation is confused with Problem Management. For
instance, the DESS contract states for Task Order 1, that an incident aﬀecting 10 or more users
gets assigned to Problem Management. The idea that a threshold for escalation, in this case
10 users, is a criteria for movement to another process, is not consistent with the ITIL standard.
As we shall see in Problem Management, the assumption that x amount of users are aﬀected
does not matter so much because it is assumed that the service being analyzed in Problem
Management is up and running with a workaround. In the case of servers or network
2/14/15 FOUO 4

equipment having service degradation in a short period of time, for instance, the workaround
could be to reboot the equipment every three days. Operationally this is not acceptable,
therefore, the next step is to move this issue to Problem Management to solve the issue of
quickly degrading service.

Metrics
The following represent some possible metrics to be used to help understand where the
Incident Management process stands in terms of eﬀectiveness for the business.

These could be found in the following sources; Incident Management system reports, Labor or
HR reports, Process and tool assessment audit ﬁndings.

KPI (Key Performance Indicators)
KPIs are calculations or measurements that are used to indicate the performance level of an
operation or process. These provide input for actionable management decisions and feed the
ACT portion of the Deming Plan-Do-Check-Act (PDCA) cycle. KPIs are derived from the
previous Metrics shown and are either the same or calculated using two or more of the metrics
to show whether tolerance thresholds are within acceptable levels.

Ref Metric
A Total # incidents
B Avg time to resolve Severity 1 and Severity 2 incidents
C # of incidents resolved within agreed service levels
D # of High/Major incidents
E # of incidents with customer impact
F # of incidents reopened
G Total available labor hours to work on incidents (non-Service Desk)
H Total Labor hours spent resolving incidents (non Service Desk)
I Incident Management tooling support level
J Incident Management process maturity
2/14/15 FOUO 5

CSF (Critical Success Factor)
Critical success factor is defined the limited number of areas in which results, if they are
satisfactory, will ensure successful competitive performance for the organization. They are the
few key areas where things must go right for the business to flourish. Moving up the chain to
higher level understanding is the concept of these critical success factors. these critical
success factors are built on metrics as can be seen below in the CSF table for Incident
Management.

Problem Manangement
Problem Management is easily defined as the cause of one or more incidents

Ref Metric Calculation
1 Total # incidents A
2 # of High/Major incidents D
3 Incident Resolution Rate C/A
4 Customer Impact Rate E/A
5 Incident Reopen Rate F/A
6 Avg Time to resolve Severity 1 and Severity 2 Incidents B
7 Incident Labor utilization rate H/G
8 Incident Management tooling support level I
9 Incident Management process maturity level J
CSF KPI
Quickly Resolve Incidents 5,6,8
Maintain IT Service Quality 1,2,3,4,8,9
Improve IT and Business Productivity 7,8
Maintain User Satisfaction 4,8,9
2/14/15 FOUO 6

Problem Management concerns itself with managing the lifecycle of problems. The objective
here is to prevent future problems and resulting incidents from occurring, eliminate recurring
incidents, and minimize the impact of incidents that cannot be prevented.

Problem Management’s activities include:

• Diagnose the root cause of incidents and to determine the resolution to those problems

• Ensure that the resolution is implemented through Change Management and Release and
Deployment Management

• Keep track of workarounds and resolutions

While separate from Incident Management, Problem Management is the process of coming up
with workarounds, analyzing related incidents, seeing patterns of faults and solutions and
feeding this information into Change management and Release & Deployment for permanent
ﬁxes to recurring problems and incidents. It would be reasonable to assume that those
personnel that work in Problem Management are Tier II and Tier III engineers.

Problem Management has 2 major processes, Reactive and Proactive. Reactive Problem
Management is handled under the Operations portion of the ITIL Lifecycle and Proactive falls
under the Continual Service Improvement portion. Below is the Problem Management ﬂow,
Metrics, and CSF

2/14/15 FOUO 7

Metrics
These could be found in the following sources; Incident Management system reports, Problem
management System Reports,Labor or HR reports, Process and tool assessment audit
ﬁndings.

Ref Metric
A # of repeat incidents
B # of Major Problems
C Total # of Incidents
D Total # of Problems in the pipeline
E # of problems removed
F # of known errors (Root cause known and Workaround in place)
G # of problems reopened
H # of problems with customer impact
I Avg problem resolution time - Severity 1 and 2 in days
J Total available labor hours to work on problems
K Total Labor hours spent working on and coordinating problems
L Problem Management tooling support level
M Problem Management Process maturity
2/14/15 FOUO 8

KPI
CSF
Conclusion
In order to be fully successful in implementing ITSM at DISA these processes will need to be
fully defined by using sources such as the ITIL Service Operation Manual and the DOD DESMF
v 2.0. Process maturity will first need to assessed via audit using ISO 15504 methods and then
continually monitored via Continual Service Improvement processes as outline in ITIL V3 and
ISO 20000 PDCA cycle. Problems cause incidents, reduce problems, you’ll reduce incidents.

Ref Metric Calculation
1 Incident repeat rate A/C
2 # of Major Problems B
3 Problem Resolution Rate E/D
4 Problem Workaround rate F/D
5 Problem Reopen Rate G/D
6 Customer Impact rate H/D
7 Avg Problem Resolution time I
8 Problem Labor utilization rate K/J
9 Problem Management Tooling support level L
Incident Management process maturity level M
CSF KPI
Minimize impact of problems (reduce incident
frequency/duration
1,2,4,6,7
Reduce unplanned labor spent on Incidents 1,3,4,5,8,9
Improve quality of services being delivered 2,6
Resolve problems and errors efficiently and
effectively
3,4,5,7,8,9,10
2/14/15 FOUO 9

Sources
1. ITIL Wiki, http://wiki.en.it-processmaps.com/index.php/Main_Page, accessed Feb. 13,
2015

2. ITIL V3 Service Operations, Axelos, Incident and Problem Management sections, 2007

3. Steinberg, Randy A. Measuring ITSM, Traﬀord Publishing, 2013

4. Manktelow, James. "Critical Success Factors: "Identifying the things that really matter for
success." Critical Success Factors. Mind Tools, n.d. Web. 17 Feb. 2015. <http://
www.mindtools.com/pages/article/newLDR_80.htm>

2/14/15 FOUO 10

2/14/15 FOUO 11
Introduction
Handlesallincidents,includingfailures,questionsbyusersor
technicalstaff,andeventthatmaybeautomaticallytriggered
Definition–anIncidentisanunplannedinterruptiontoanIT
serviceorthereductioninthequalityofanITservice.Failure
ofaCIthathasnotyetaffectedserviceisalsoanincident.
Objective–toresumeregularstateasquicklyaspossibleto
minimizeimpact
Scope–allincidentsreportedbyusersortools.Aservice
requestisnotanincident
BusinessValue
▪Reducedowntime
▪Alignserviceswithbusinesspriorities
▪Establishpriorities
BasicConcepts
▪Timelimits▪–agreedtobyoperationallevelagreements
(OLAs)andUnderpinningContracts(UCs–seeSupplier
Management)
▪Incidentmodels▪-awaytodeterminethestepsthatare
necessarytoexecutetheprocesscorrectly,especiallyin
termsoftypesandprioritiesofincidents
▪Majorincident▪–aseparateprocedureisrequiredfor
majorincidents–i.e.shortertimeframesandhigher
urgency–mustbeagreeduponbeforehand
Activities,Methods,andTechniques
Theincidentmanagementprocesshasthefollowingsteps
▪Identification▪–incidentmustbeknown,monitoringtoolsarehelpful
▪Registration▪–allrelevantinformationmustberegistered
▪Classification▪–importantforlateranalysis,mustbeconsistent
▪Prioritization▪–establishurgencyandimpact
▪Diagnoses▪–recordgreatestpossiblenumberofsymptoms;ifknownthenresolvethen
▪Escalation▪–
-functionalescalation–tier1totier2totier3;musthavetimeframesbetween
-hierarchicalescalation–calluponmanagementtogetproblemsolved;mayfollow
functionalescalation
▪Investigation▪–eachtiermakesitsowndiagnosesanddocuments
▪Resolution▪andrecovery–whensolutionisfounditmustbetested–byuser,centrally
andbysupplierifnecessary
▪Closing▪–Servicedeskclosesticket
Interfaces
▪ProblemManagement
▪ConfigurationManagement
▪ChangeManagement
▪CapacityManagement
▪AvailabilityManagement
▪ServiceLevelManagement
Metrics
▪total#ofincidents
▪#and%ofmajorincidents
▪Avgcostperincident
▪#and%ofcorrectlyallocatedincidents
▪%ofincidentshandledintheagreedamountoftime
Implementation
▪Detectincidentsasquicklyaspossible
▪Allincidentsmustberegistered(useRemedy)
▪Previousknowledgemustbeavailabletolearnfrom
▪MustbeintegratedwiththeCMDBtohelpdeterminerelationshipbetweenCis
▪IntegratedwithSLA;helpsdetermineimpactandpriority
▪CriticalSuccessFactors
-agoodservicedesk
-clearlydefinedSLAtargets
-adequatesupportstaff
-Integratedsupporttools
-OLAandUCtoshapebehaviorofsupportpersonnel
Risks
▪Iftoomanyincidentscomein;thenincidentscannotbehandledinthetimeframeagreedto
▪Ifasupporttooldoesnotwarnofalackofprogressonanincident;thenitcanbecomestagnant
▪Ifnointegrationorlackoftools;thentherewillbealackofadequateinformationsources
▪IfnoOLAsorUCs;thennocoincidingobjectivesandunalignedprocesses
Incident Management

2/14/15 FOUO 12
Introduction
Definition–AProblemistheunknowncauseofoneormore
incidents
Objective–topreventproblemsandincidents,eliminate
repeatingincidentsandminimizetheimpactofincidentsthat
cannotbeprevented
Scope–allactivitiesneededtodiagnosetheunderlyingcause
ofincidentsandtofindasolutiontotheseproblems.Uses
changeandconfigurationmanagementtoimplementanyfixes
BusinessValue
▪EnsureimprovementsintheavailabilityandqualityoftheIT
serviceprovisions
▪Resolutioninformationisusedtoaccelerateincident
handlingandidentifypermanentsolutions
▪Reduces#ofincidentsandhandlingtimeyieldingshorter
disruptiontimesandfewerdisruptionsoverall
BasicConcepts
▪Known-error▪–aproblemthathasadocumentedroot
causeandaworkaround
▪Workaround▪–reducingoreliminatingtheimpactofan
incidentorproblemforwhichafullresolutionisnotyet
available
▪AknownerrorDB(KEDB)▪isusedforfasterdiagnoses,
thecreationofaproblemmodelforhandlingfuture
problems
▪ProblemModel▪–stepsthatneedtobetaken,
responsibilitiesofpeopleinvolvedandnecessary
timescales
Risks
▪Iftoomanyincidentscomein;thenincidentscannotbe
handledinthetimeframeagreedto
▪Ifasupporttooldoesnotwarnofalackofprogressonan
incident;thenitcanbecomestagnant
▪Ifnointegrationorlackoftools;thentherewillbealackof
adequateinformationsources
▪IfnoOLAsorUCs;thennocoincidingobjectivesand
unalignedprocesses
Activities,Methods,andTechniques
2importantprocesses
▪Reactiveproblemmanagement–performedbyserviceoperations
-identification–via(1)servicedeskidentifiesanunknowncauseofoneormore
incidents=problemregistrationor(2)analysisofincidentbysupportgrouprevealsa
problemor(3)automatedtracingoferrorviatoolor(4)supplierreportsproblem
-registration–requiresdateandtimestampandhistoricreportforcontrol
-classification–isthesameasincidentmanagement
-prioritization–repairedorreplaced,costs?,resourcesneededtosolveproblem,
time
-investigationanddiagnoses–differenttypesofdiagnoses:chronologicalanalysis,
PainValue,Kepner-Tregoe,brainstorming,Ishikawadiagrams,Pareto
-workaround–needstobedoneasquickaspossiblewhilekeepingtheproblemopen
forfinalsolution
-identifiedknownerrors–mustbeputintoaknownerrordatabase,thiswayother
incidentsthatcomeupcanbetiedtothesameproblemifnecessary
-resolution–assoonasasolutionisfounditshouldbeappliedimmediatelygoing
throughproperchangemanagementtoincludethoroughtesting
-conclusion–onlyafterfullevaluationandappliedsolutionalsoappliedtoincidents
associatedwiththisproblem
-review-performlessonslearned
▪ProactiveproblemManagement–apartofServiceOperations,buthandledmostlyunder
CSIinconjunctionwiththeKEDB.
Interfaces
▪IncidentManagement
▪ConfigurationManagement
▪ChangeManagement
▪CapacityManagement
▪AvailabilityManagement
▪ServiceLevelManagement
▪FinancialManagement
Implementation
▪Highlydependantonthematurityofincident
managementprocessesandtoolsasthesehelpto
identifyproblems
Metrics
▪Totalnumberofproblemsregisteredwithinagivenperiod
▪%ofproblemsthatwereresolvedwithinSLAtargets(anditsinverse)
▪#and%ofproblemsforwhichmoretimewasneededtoresolvethem
▪Backlogofoutstandingproblemsandtrend
▪Average$ofhandlingaproblem
▪#ofproblemsoutstandingaccordingtoclassification
▪%ofsuccessfulmajorproblemreviews
▪#ofknownerrorsaddedtotheKEDB
▪Accuracy%ofKEDB(fromDBchecks)▪
Problem Management

Intentionally left blank
2/14/15 FOUO 13

Incident-vs-Problem Management White Paper

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

Incident-vs-Problem Management White Paper