Essential elements of data center facility operations
1. Essential Elements of Data
Center Facility Operations
Schneider Electric
Data Center Science Center
White Paper 196
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
2. 70% of data center outages are directly attributable to human
error according to the Uptime Institute’s analysis of their
“abnormal incident” reporting (AIR) database1. This figure
highlights the critical importance of having an effective operations
and maintenance (O&M) program. This presentation describes
unique management principles and provides a comprehensive,
high-level overview of the necessary program elements for
operating a mission critical facility efficiently and reliably
throughout its life cycle. Practical management tips and advice
are also given.
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
3. Introduction
Importance of operations and maintenance (O&M) program
• Most facility outages attributable to human (operator) error
• Majority of data center facility TCO is in OPEX, not CAPEX, where greatest
potential cost savings reside
• Largest portion of OPEX are energy costs, which are rising
• Drive for energy efficiency reducing capacity safety margins and system
redundancy, increasing importance of proactive
maintenance and data center infrastructure
management (DCIM)
• High levels of facility automation and equipment
performance data have created new opportunities
for enhancing reliability while reducing costs,
when properly managed
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
4. Mission Critical Mentality
Failure is not an option
● Focuses on risk mitigation
● Grasps interconnectedness of facility
and IT systems
● Data center availability is paramount
● Highly complex, fast-paced changes
in mission critical facility
● Challenging to manage
● Unique outside pressures
● Government regulations
● Customer audits
NOTE: In this paper, only system planning is covered. System planning refers to the power, cooling, racks,
and other support infrastructure systems. Planning related to the IT equipment is not discussed here.
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
5. Mission Critical Mentality
Code of Conduct
“Mission Critical Mindset” principles Impact
Focused on risk mitigation in all operational and
maintenance activities, work processes, and
procedures
Proactively deals with all potential threats to
system availability and worker/occupant safety
Acting with confidence and patience that is an
outgrowth of careful planning and preparation
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
Prevents risks from becoming problems;
enables faster response times and fewer errors
if problems do arise
Analytical, process-driven approach to risk
avoidance and problem solving
Helps identify and mitigate risk in complex
environments; ensures predictable and safe
operation
Comprehensive understanding of the function and
interconnectedness of facility systems and
components
Quickly identify and resolve potential threats
or actual problems; avoid or reduce system
downtime
Commitment to continuous learning and process
improvement
Increases skills and operational efficiency to
maintain an edge in a constantly changing
environment
6. 12 Essential Elements of an O&M Program
Environmental Health and Safety
● Key components include
● Injury, illness prevention
● Electrical safety
● Hazard analysis
● Hazard communication
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
7. 12 Essential Elements of an O&M Program
Environmental Health and Safety
Key Program Attributes Description
Safety plans and training
Written safety plans must be established that describe the safe work practices and
procedures to be observed by all workers. Regular training on the program
elements must also be conducted.
Hazard analysis
All operational procedures shall start with an analysis of the possible hazards
involved. Risks must be identified and safety measures assigned.
Lockout/tagout procedures
Proper procedures to prevent the unexpected energizing or startup of machines or
equipment (or which causes a release of stored energy) shall be used when
servicing or maintaining equipment.
Personal protective equipment
(PPE)
Appropriate protective equipment should be provided, properly sized, stored,
maintained, and utilized as required to mitigate identified safety hazards.
Hazardous material handling
Hazardous materials must be properly identified, labeled, stored, maintained, and
used in conformance with manufacturer’s requirements, local laws, and
ordinances.
Hazard communications program
Includes a list of hazardous chemicals, use of material safety data sheets (MSDS),
proper labeling of all hazardous materials containers, and employee training on use
of and protection from hazardous materials.
Compliance with all applicable
health and safety laws and
regulations
Requirements will likely vary by region and by level of government (e.g., local,
state, federal).
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
8. 12 Essential Elements of an O&M Program
Personnel Management
● Hiring and training
● Competent, team-oriented people with
mission critical mentality
● Well-rounded team
● Develop staffing model
● Clearly defined roles and responsibilities
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
9. 12 Essential Elements of an O&M Program
Emergency Preparedness and Response
● Develop emergency operating
procedures – EOPs – for all high-risk
failure scenarios
● Develop, rehearse escalation
procedures
● Conduct regular scenario drills
● Formal failure analysis for significant
facility events
See White Paper 199, “Data Center Emergency Preparedness and Response”, for
more information.
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
10. 12 Essential Elements of an O&M Program
Maintenance Management
● Key tasks
● Asset management
● Work order management
● Spare parts management
● Ensure power and cooling continual performance
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Improved reliability with
● Good asset intelligence
● Proactive and preventative predictive
maintenance plan
● Results in
● More accurate maintenance budget
forecasts
● Minimized TCO and downtime
11. 12 Essential Elements of an O&M Program
Maintenance Management > Asset Management
● Accurate, consistent tracking of critical facility assets
● Computerized maintenance management system (CMMS)
● Record, track, and manage asset data and maintenance history
● Scope of service (SOS)
● Defines maintenance frequency, specific activities, # of man hours
● Establishes standard for procurement of
● Service agreements
● Maintenance scheduling
● Procedure development
● Continuous program improvement
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
12. 12 Essential Elements of an O&M Program
Maintenance Management > Asset Management
● Recommended asset management information
● Type - top level classification (e.g. electrical,
mechanical, fire system)
● Sub-type (e.g. PDU, UPS, CRAH)
● Text description of asset
● Make - asset manufacturer name
● Model - manufacturer model #
● Size or rating
● Location ID (room/area)
● Trade responsible for maintenance
● Manufacturer serial #
● Install date
● Warranty expiration date
● Date asset to be replaced
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
13. 12 Essential Elements of an O&M Program
Maintenance Management > Work Order Management
● Tool for service process management
● Allows work to be
● Correctly prioritized
● Assigned the right resources
● Complete d on schedule
● Standalone ticketing system OR
● Integrated work order module in a
CMS or DCIM system
● Provide valuable information to facility personnel
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
14. 12 Essential Elements of an O&M Program
Maintenance Management > Spare Parts Management
● Shortens mean time to recovery MTTR
● Inventory should include parts with lead times longer than acceptable
downtime
● Maintain spare parts list
● Stock frequently used items
● Re-evaluate annually
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
15. 12 Essential Elements of an O&M Program
Change Management
● Method of Procedure - MOP
- process
● Detailed checklist of
specified tasks
● MOP helps control work
activity along with
● Operational procedure
development and review
● Risk analysis and
communication
● Structured work practices
● Vendor/contractor
supervision
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
16. 12 Essential Elements of an O&M Program
Documentation Management
● Facilitates development of
● Accurate procedures
● Proper training
● Workplace safety
● Process improvement
● Document management software application
● System to keep critical infrastructure records
organized, up-to-date
● Detailed checklist of specified tasks
● Manual process can also work
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
17. 12 Essential Elements of an O&M Program
Training
● Establish training program that organizes operational and maintenance
tasks into categories
● Mapped to capability levels – basic, intermediate, advanced
● Train and evaluate personnel to certify them
● Require annual recertification exams
● Ongoing education keeps personnel current
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
18. 12 Essential Elements of an O&M Program
Infrastructure Management
● System to match facility resources with changing IT requirements
● Prevent downtime
● Improve resiliency
and response
● Reduce operating
expenses
● Provide a sound
basis for capacity
planning decisions
● Three key tasks
● Facility monitoring
● Capacity management
● IT/Facilities integration
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
19. 12 Essential Elements of an O&M Program
Quality Management
● Key components
● Quality Assurance (QA): Typified by process and procedure
standardization
● Quality Control (QC): Quality checks, inspections, and audits
● Continuous Quality Improvement
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
20. 12 Essential Elements of an O&M Program
Energy Management
● Energy typically the single
largest data center expense
● 3 core tasks of an effective
energy management program
● Performance benchmarking
● Efficiency analysis
● Strategic energy sourcing
● Optimized energy sourcing
● Reduce exposure to price volatility
● Secure pricing that fits budget and business objectives
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
21. 12 Essential Elements of an O&M Program
Financial Management
● Financial-related issues can impact facility’s
day-to-day availability and resiliency
● Processes should focus on
● Purchasing
● Invoice matching
● Financial reporting/analysis
● Facility managers and purchasing department
should maintain close relationship
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
22. 12 Essential Elements of an O&M Program
Performance Monitoring and Review
● Regularly monitor and review facility
performance
● Determines health and effectiveness
of O&M program
● Shows where it is trending
● Quality process should incorporate
facility KPIs
● Benefits
● Aligns operational activities with
business goals
● Positive reinforcement for innovation
and process improvement
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
23. Common Mistakes
Common Mistakes Description
Maintenance program is not driven
by metrics
Often the result of poor asset management
No linkage made between break/fix maintenance
activities and preventative maintenance
Poor training
Training is not formalized and/or is not taken seriously
Over-reliance on technician “shadowing”
No linkage between certification level and tasking
Ineffective change management
Inadequate risk analysis
Poor or non-existent procedures
No defined process for performing critical work tasks
Failure to consistently test &
evaluate skills
Existing skills/training level not formally evaluated
Scenario drills are not employed
Incident and drill results are not evaluated
Poor documentation
No coherent sequence of operations
Drawings and schedules are outdated
Lack of revision control and/or lack of digitization
Failure to develop and implement a
quality control system
Lack of governance or resources to measure, monitor,
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
and review performance
Stuck in manual mode Failure to implement CMMS, EDMS, DCIM, etc
Overconfidence
Assumption that future performance can be predicted
by past experience
24. Facility Operations Services
Using Outside Vendors for O&M Programs
● Offer services for both existing and new data centers
● Advise on
● Develop
● Implement
● Operate
See White Paper 198, “How to Write an Effective RFP for Data Center Facility
Operations Services”, for more information.
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
25. 12 Essential Elements of an O&M Program
Performance Monitoring and Review > Recommended Facility KPIs
● Critical load uptime
● Load redundancy
maintained
● Support system uptime
● Safety policy and procedure
adherence
● Procedure development,
management and use
● Maintenance completion
● Staffing coverage
● Security policy
conformance
● Emergency preparedness
drills
● Emergency response
procedure adherence
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
● Quality control/improvement
● Training compliance
● Process improvement
● Operational reporting
● Proper event notification and
escalation
● Timely and accurate cost reporting
26. Conclusion
● Efficient Operations & Maintenance program
● Mitigates threats, effects of human error
● Focus on 12 essential elements of O&M program
● Must have facilities operation team with “mission critical” mindset
● Operational philosophy focuses on
● risk mitigation
● Preparedness
● standardized processes
● continuous improvement
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014
27. Resources
Facility Operations Maturity Model for Data Centers
White Paper 197
How To Write an Effective RFP For Data Center Facility Operations Services
White Paper 198
Data Center Emergency Preparedness and Response
White Paper 199
Classification of Data Center Infrastructure Management (DCIM) Tools
White Paper 104
How Data Center Infrastructure Management (DCIM) Software Improves Planning and Cuts
Operational Costs
White Paper 107
Avoiding Common Pitfalls of Evaluating and Implementing DCIM Software
White Paper 170
Browse all APC white papers
whitepapers.apc.com
Browse all APC TradeOff Tools™
tools.apc.com
Schneider Electric – Data Center Science Center WP 196 Presentation – February 2014