AWS Summit 2013 | Auckland - Building Web Scale Applications with AWS
Capacity Management for SAN
1. Metron
Capacity Management
for SAN Attached Storage
Warning: Low Disk Space
2. Metron-Athene
• Established 1986
• Stable ownership
• Consistent Focus on CM
• Industry Leadership
www.metron-athene.com
3. Athene
z/OS, HP-UX, AIX, Solaris, Linux
Data Source
Acquire Framework DB/Application
Virtual Server Custom
Control Center
Capacity Database
4. Objectives
• Trends in storage technology.
• Define two distinct aspects of storage capacity.
• Examine key areas related to capacity management of SAN
attached storage.
• Equate with business value.
• Show how tools like Athene can help you achieve your goals.
• Provide ideas about how to proceed with improving storage
capacity management processes in your environment.
5. Trends
• Solid state devices
• Cloud storage
• Embedded storage (e.g. Exadata, vBlock)
• Big data (e.g. Hadoop)
• Tiered storage
• Primary de-duplication
• FCoE, 16 Gbps Fiber, and 10 Gbps Ethernet
6. Two Distinct Aspects of Storage Capacity
Disk Performance Capacity
Response, IOPs
Disk Space Capacity
Bytes
7. Space Capacity – Growth (measureable)
Changing demands for storage – Slope of line
8. Space Capacity - History
Growth can result in increasing cost and complexity
9. Space Capacity – Growth and Cost Factors
Growth
• Business as usual (Trend)
• Acquisitions
• New applications and projects
Costs
• Equipment, including power
• Resource management, including people
• Storage use by application (Billable Customers)
10. Space Capacity – Storage as a Service
How much are customers consuming?
Don’t forget
about the IT
department
and other
insiders!
11. Space Capacity – Tiered Service Model
Define what tiers are (platinum, gold, silver, etc…)
Rates should be
adjusted on a
frequent basis.
Estimate growth
versus storage cost
declines.
Billing is an
effective way to
create
accountability.
12. Space Capacity – Management Support
Effective storage management happens with a bridge to
business results, and building that bridge begins with a solid
foundation. Show business value to be self evident.
13. Space Capacity – Business View
With management backing, important processes can be implemented
Business IT
• Capacity budgeting and
inventory management
• Mandatory storage
request process
• Storage mapping to
determine ownership
• Chargeback of some form
• Define executive reporting
requirements
Once the bridge is built reporting information can flow freely
14. Space Capacity – Who is Responsible
Managing storage capacity requires work.
Storage administrators typically have limited time and
higher priorities in their complex environments.
15. Space Capacity – Over and Under Provisioning
Administrators may have no choice but to over
allocate which results in low utilization.
It is important to define
exactly what ‘Utilization’ is
for your storage.
Many factors determine
what ‘Right Sized’ means
for each system.
But, running out of space
means only one thing to all.
16. Space Capacity – Doing the Technical Work
After roles and responsibilities are assigned and business
requirements are complete, technical solutions can be implemented
to optimize storage space management, including databases.
Trending, forecasting, and exceptions.
17. Space Capacity – Different Viewpoints
Business, Application, Host, Storage Array, Billing Tier
If billing for storage ensure transparency with detail reports
18. Space Capacity – Virtual Environments and Clusters
Managing storage in clustered and/or virtual environment can be challenging
because it is shared among all hosts and virtual machines running on it.
• Manage capacity at a high level
• Account for storage use at a low
level, e.g. VM or DB
• If billing be cautious of different
tiers being allocated to the same
cluster.
• Don’t forget about overhead
Overcommit with thin provisioning
19. Space Capacity – Storage Virtualization
Pooling physical storage from multiple sources into logical groupings
• Simplifies Administration
• Can be a centralized source for
collecting data
• If using as a data source beware
of double counting with backend
• Don’t forget about overhead for
replication
Wide variety of techniques for virtualizing storage, be aware of
the implications for data collection and reporting
20. Space Capacity – Best Practices
Find dark and hidden storage, where it has been
allocated and never used, or plugged into a different box.
Use thin provisioning and de-duplication
where possible.
Include data retention policies for
storage space management.
Account for overhead from
RAID, replication, file systems, etc…
Understand the value of data in deciding where to put
it, how to protect it, and how long to keep it.
21. Space Capacity – Best Practices
Understand the limitations of linear regression when trending
and forecasting data. Use statistics like R^2 to confirm.
Be sure to account for all
variables when ‘Right Sizing’!
Include directory and file level
reporting for file servers if possible.
22. Performance Capacity – Response Impacts
SAN or storage array performance problems can have serious
impacts over a long duration, and be difficult to identify.
23. Performance Capacity – Metrics
Understand the limitations of certain metrics
• Measured response is the best metric
for identifying trouble.
• Host utilization only shows busy time,
it doesn’t give capacity for SAN.
• Physical IOPs is an important
measure of throughput, all disks have
their limitation.
• Queue Length is a good indicator that
a limitation has been reached
somewhere.
24. Performance Capacity – Metric Thresholds
Many times critical host disk metrics are not
breached during impactful events.
Consider using
Statistical
Process Control
Are these potential problems having a real impact?
25. Performance Capacity – Metric Thresholds (Host)
Other times certain metrics like utilization are indicating
impactful events, but ample capacity is still available.
26. Performance Capacity – Metric Thresholds (Host)
Queue lengths from the previous utilization indicate that it may
not currently be impacting response, but headroom is unknown.
27. Performance Capacity – Metric Thresholds (Host)
The high utilization can be seen generating large amounts
of I/O in this chart.
28. Performance Capacity – Architecture (Array)
• Front End Processors
• Shared Cache
• Back End Processors
• Disk Storage
29. Performance Capacity – Metric Thresholds (Array)
Front end processors are typically the first to bottleneck
30. Performance Capacity – Metric Thresholds (Array)
Impact of utilization on response for a single processor
Curves based on simple queuing with normal distribution
31. Performance Capacity – Component Breakdown
Service time versus response time – different metrics
32. Performance Capacity – Workload Profiles
I/O profile has a big impact on performance. Be sure to
include it when comparing applications.
Test with tools like Iometer, IOzone, Bonnie, etc…
35. Performance Capacity – Best Practices
• Choose service levels and establish baselines.
• Use available data sources, vendor utilities, etc…
• Consolidate reporting tools and data. (Athene)
36. Storage Capacity – Final Thoughts
• Talk with storage team about current state of reporting and fill in the gaps.
• Fabric and network utilization might be in scope.
• Set priorities for where to spend time and effort.
• Simplify where possible.
• Work to establish formal naming conventions where needed.
• Tools - without knowledge, experience, and commitment won’t help.
37. Storage Capacity – Thank you for attending
Capacity Management
for SAN Attached Storage
Dale Feiste
Metron-Athene Inc.
dale@metron-athene.com
Notas del editor
- A good first step to implementing effective capacity management for SAN attached storage is to ensure that you are managing the non-SAN specific aspects of storage first. A second important step is recognizing what limitations and gaps exist from the host perspective.
Keep in mind the level at which disk space runs out (e.g. file systems, drives, volumes, etc…). Typically this is where monitoring is configured, but it can be proactive.Also remember that multiple I/O requests can be in flight at the same time just like other networking protocols, controlled by queue depth settings.
- Aggregate data to the appropriate level for reporting to a given audience.
Highlight storage for IT, unknown, and other unbillable storage.If customers have a blank check they will consume a lot more storage.Having many tools that all consume data can add up. Athene consolidates your data for capacity management.Make sure all allocated storage has an owner.
All storage is not created equal.Opposing forces of growth and decreasing cost of storage. If costs stop decreasing, like CPU speeds stopped increasing, look out. Physical limits can be reached for storage density.Primary focus on billing is giving accountability first, rather than ensuring exact financial accounting of real costs. Yeah, it may not be all real, but it’s better than an open checkbook.
Ideally you could do a business study, then create a business plan based on those results (i.e. cost/benefit analysis).Need a compelling story to generate interest.
- How much storage can administrators manage? It depends on many factors.
Are we talking utilization on the host or SAN side? Does it include overheads for file systems, RAID, DR, etc…?Right size for backups, growth, variability, etc…Start with most important low hanging fruit.
Proactive management with automated trending. Be aware that fighting fires is more glamorous and visible.It’s easy to get buried with data, filter out the noise with exceptions and filters (10% of 10GB vs. 10% of 1TB).All trend lines are not created equal.
Storage vmotion in vSphere 5 will load balance based on datastore performance.Thin provisioning may not be appropriate in situations where delays for expanding storage are not acceptable
Compare advantages of using virtual storage to distribute over more spindles versus specific placement, admin and performance.Mention types of in band versus out of band virtualization. Host, SAN, and Array components required.
- How do you find dark and hidden storage? Compare allocated versus what shows up on hosts and asset management.
- Also, proportion of samples over a threshold and variablity.
It can also be in the reverse where the host looks okay, but there is an impact. Measured I/O response is the best way to determine what the OS is experiencing.Also, significant changes from normal can indicate problems.
If the line waiting for service increases, either your throughput or service time has increased.Queues don’t typically increase in a linear fashion, things can fall apart quickly when this spikes up. Can be good for monitoring and diagnosis but not planning.
- Individual disks may go to completely different areas of backend storage. An impact in one area can be to traced back through to the root problem.