In this session, Cormac Hogan and Julienne Pham of VMware take a comprehensive look at the setup, policy management, failure handling, and monitoring tools needed to perform a successful Proof of Concept. This session empowered attendees to go and implement their own VSAN POCs.
Handwritten Text Recognition for manuscripts and early printed texts
VMworld 2015: Conducting a Successful Virtual SAN Proof of Concept
1. Conducting a Successful Virtual SAN
Proof of Concept
Cormac Hogan, VMware, Inc
Julienne Pham, VMware, Inc
STO4572
#STO4572
2. • This presentation may contain product features that are currently under development.
• This overview of new technology represents no commitment from VMware to deliver these
features in any generally available product.
• Features are subject to change, and must not be included in contracts, purchase orders, or
sales agreements of any kind.
• Technical feasibility and market demand will affect final delivery.
• Pricing and packaging for any new technologies or features discussed or presented have not
been determined.
Disclaimer
CONFIDENTIAL 2
4. Agenda
1 Introduction to STO4572 Session
2 Introduction to Virtual SAN
3 Initial consideration for a proof of concept on VSAN
4 Tools available to conduct a successful proof of concept
5 POC validation scenarios
6 Measuring Performance
7 Moving from POC to Production
4CONFIDENTIAL
5. This Session…
• Virtual SAN has been available for 18 months
• VMware recognizes that conducting a Virtual SAN proof of concept can be challenging
• Since the launch of Virtual SAN, additional tools for managing, monitoring and troubleshooting
Virtual SAN have become available
• In this session, the tools available to vSphere and Virtual SAN administrators will be discussed,
and how they can help deliver a Virtual SAN proof of concept
• The session will also cover considerations of moving Virtual SAN from POC to production
5CONFIDENTIAL
6. Unprecedented Customer Momentum
2000+ Customers in
the first 15 months
In my experience VMware solutions are
rock solid…we’re ready to nearly double
our VSAN deployment.
It really did work as advertised…the fact
that I have been able to set it and forget
it is huge!
CONFIDENTIAL 6
7. Introduction to VMware Virtual SAN
7
• Storage scale out architecture
built into the hypervisor
• Aggregates locally attached storage
from each ESXi host in a cluster
• Dynamic capacity and
performance scalability
• Flash optimized storage solution
• Fully integrated with vSphere and interoperable:
• vMotion, DRS, HA, VDP, VR …
• VM-centric data operations
+ + + +
+ + +
…
+
CONFIDENTIAL
9. Before Considering a Virtual SAN PoC
Accelerate Use Case
Planning Outcome
CONFIDENTIAL 9
10. Organization Challenges
Culture Barrier
• The fear about what
you do not know and
the lack of control
and visibility
Storage team
operations
• New methodology
• New way to see
things and operate
• Converged compute
and storage
Support
• Single Point
of Contact
• No vendor
finger pointing
10CONFIDENTIAL
11. Technical Requirements
• EVO:RAIL, VSAN Ready Node or Do-it yourself
• Uniform configuration
Hardware
• Shared Network VS Dedicated
• Distributed Switch VS Standard
• Multicast
Networking
• Controller choices
• RAID0 VS Pass-through
• SSD/HDD Ratio Choices
• Performance VS Endurance
• SAS Expanders
Storage
11CONFIDENTIAL
12. What I Need to
Be Successful
Tools to conduct a successful Virtual SAN POC
13. Success Tool #1: Health Plugin
• Introduced with Virtual SAN 6.0
• Incorporate in the vSphere Web Client
• Virtual SAN Health Check tool include:
– General Health
– Proactive tests
– Virtual SAN HCL health
– Physical disk health
13
• Especially useful to
observe injected errors
and verifying that they
have been remediated
CONFIDENTIAL
14. Success Tool #1: Health Plugin
• Proactive tools running on
Virtual SAN cluster and
pre-production tests
– VM Creation test
– Storage Load test
– Multicast Performance test
14CONFIDENTIAL
15. Success Tool #2: RVC/Virtual SAN Observer
• Native tools installed on VCSA and on VC Windows
• Used for Configuration and Status of the Virtual SAN Cluster
• For Performance and Activity monitoring on demand
– VM level
– Host level
– VMDK level
– HDD/SSD Level
• Any anomalies will show up with the metric in question shown in red
CONFIDENTIAL 15
16. Success Tool #2: RVC/Virtual SAN Observer
vsan.apply_license_to_cluster
vsan.enable_vsan_on_cluster
vsan.disable_vsan_on_cluster
vsan.clear_disks_cache
vsan.cluster_change_autoclaim
vsan.cluster_set_default_policy
vsan.enter_maintenance_mode
vsan.fix_renamed_vms
vsan.object_reconfigure
vsan.host_wipe_vsan_disks
vsan.recover_spbm
vsan.reapply_vsan_vmknic_config
16
Cluster
vsan.check_limits
vsan.check_state
vsan.cluster_info
vsan.cmmds_find
vsan.whatif_host_failures
vsan.resync_dashboard
Disk
vsan.disk_object_info
vsan.disks_info
vsan.disks_stats
Host
vsan.host_info
vsan.host_consume_disks
Networking
vsan.lldpnetmap
VM
vsan.vm_object_info
vsan.vm_perf_stats
vsan.vmdk_stats
vsan.obj_status_report
vsan.object_info
Troubleshooting
vsan.support_information
vsan.observer
Virtual SAN Operation Virtual SAN Information
Virtual SAN Monitoring
CONFIDENTIAL
17. Success Tool #3: Virtual SAN Pack for vROps
• Integrate to the comprehensive vSphere monitoring software vRealize Operations 6.0.1
• Available on Advanced or Enterprise Edition
• Collect SSD/HDD disk performance across the cluster
• Collect SMART information
• Monitor information across multiple level :
– disk group
– host
– cluster
– datacenter
17CONFIDENTIAL
18. Custom Dashboards
18
In the VSAN cluster
• Disk Group Throughput
• SSD/MDs Information
• Capacity Usage by hosts
CONFIDENTIAL
19. Success Tool #4: Log Insight
• Built-In with VMware - vSphere
• Troubleshooting tool
• Logging Analytic tools
• Any Virtual SAN failure can be correlate
between hosts and disk groups
• Track Virtual SAN operations
CONFIDENTIAL 19
Storage – VSAN view
Storage – VSAN Interactive
Analytic view
21. PoC Validation
• What are the most important test validation?
1. Successful VSAN configuration
2. Successful VM deployments on VSAN datastore
3. VM Availability in the event of failures (host, storage device, network)
4. VSAN serviceability
5. VM Performance meets expectations
CONFIDENTIAL 21
22. Case #1 – Successfully Deploy VSAN
• Ensure correct vSphere versions
• Appropriate licenses are available (if PoC is going to take a long time)
• Ensure network is in place. Remember multicast requirement, so prep the network team.
• Minimum of three servers.
• Minimum of three servers contributing storage:
– At least one storage controller – check the HCL, verify drivers and firmware are valid
– At least one flash device (SSD, PCIe) for cache – make sure these are on HCL
– At least one magnetic disk or flash device for capacity – check the HCL
– Or consider VSAN Ready Nodes as an option …
CONFIDENTIAL 22
Remember, the VSAN Health Check will do most of this work for you
23. Case #1 – Successfully Deploy VSAN
CONFIDENTIAL 23
Run this after
every test!
Also use it to
make sure you
fixed the problem
you previously
introduced!
Check the Virtual SAN Health Check plugin regularly
24. Case #2: Successful VM Deployment
Use the Health Check to do initial VM deployment check
CONFIDENTIAL 24
Part of the Proactive Tests. This
will verify if VMs can be created
on VSAN cluster
25. Case #2: Successful VM Deployment
I created a new VM, but I am not sure where the VM is stored
CONFIDENTIAL 25
Component host location
26. Case #3: VM Availability in the Event of Failures
• There are various failures that may be introduced as part of a typical POC
– Host failure
– Flash device / Magnetic Disk failure – Cache/Capacity failures
– Network failure
• The primary objective is to ensure that the VM continues to be available in
the event of a failure. This might mean the VM is restarted on another node
in the cluster.
• vSphere HA also has a role to play here. It is integrated with Virtual SAN.
CONFIDENTIAL 26
27. Case #3.1: Host Failures
• How many hosts do I really need?
• A minimum of 3 hosts is needed to support VSAN
• What about rebuilding after a failure or maintenance mode operations?
• If you want virtual machines to remain highly available on VSAN during these scenarios,
consider configuring for additional capacity i.e. minimum 4 nodes
CONFIDENTIAL 27
28. Case #3.2: Storage Failures
• The Virtual SAN 6.0 Proof Of Concept Guide has details on how to inject temporary disk errors
for the purpose of testing
– A real disk failure results in immediate rebuild activity initiated by VSAN
CONFIDENTIAL 28
Eject/Offline/Unplug: Absent
Wait 60 minutes
before remediation
Failure: Degraded
Immediate remediation
29. Case #3.3: Network Failure
CONFIDENTIAL 29
Part of the Proactive Tests. This
will verify if multicast
performance is acceptable can
for VSAN cluster
Multicast configuration is the most common issue
30. Case #3.4: Validating Rebuild Activity after Failure
30
• Virtual SAN might need to move data around in the background: change policy, host failure, long
term/permanent component loss, user triggered reconfig, maintenance mode, etc.
• UI Resync Dashboard shows the VMs that are resyncing and remaining bytes to sync
Remember!
Test one
thing at a
time!
CONFIDENTIAL
31. Case #4: VSAN Serviceability
I want to update one of my ESXi host in a VSAN cluster, what do I do?
CONFIDENTIAL 31
VSAN provides multiple options
for maintenance mode
32. Case #4: VSAN Serviceability
Ensure Availability Full Data Migration No data Migration
Lost of VM compliance Full VM Data compliance No VM availability ensured
Short time maintenance More than one hour
of Maintenance
Short time maintenance
Short Storage preparation Long storage preparation No Impact
Limited Free Storage
space required
Free Storage space requirements
on the other nodes
No Impact
CONFIDENTIAL 32
33. Case #4: Management – Disks Serviceability
Disk serviceability feature enables identification of to be replaced magnetic disks and flash based
CONFIDENTIAL 33
34. Case #4: Management – Disk/Disk Group Evacuation
• Allows you to evacuate data from disk groups and individual disks before removing a
disk/disk group from a Virtual SAN host
• Allows Virtual SAN to ensure all workloads stay fully compliant with their policy!
• Supported in the UI, ESXCLI and RVC
• Check box in the “Remove disk/disk group” UI screen
34CONFIDENTIAL
36. How to Test Performance…
• The distributed architecture of VMware Virtual SAN dictates that reasonable performance is
achieved when the pooled compute and storage resources in the cluster are well utilized
• This usually means a number of VMs each running the specified workload should be distributed
in the cluster and run in a consistent manner to deliver aggregated performance
• This part of an evaluation can be complex and time-consuming
• Real application workloads are best, but …
– synthetic workloads (IOmeter) might be easier to set up
– simplistic workloads don’t really reflect what Virtual SAN can do
• Worth a read: Pro Tips For Storage Performance Testing
– http://blogs.vmware.com/storage/2015/08/12/tips-storage-performance-testing/
CONFIDENTIAL 36
37. Performance Testing Considerations
CONFIDENTIAL 37
Is the test utilizing the distributed storage resources of Virtual SAN?
• Multiple VMs across multiple hosts will deliver better performance than a single VM on one host
Is the working set fully in cache, utilizing flash performance?
• Read-cache misses will incur latency
Is the workload cache friendly?
• Sustained sequential write workloads fill cache, which must then be destaged. Mixed R/W
workloads are best
Is the cache warmed?
• Initial results from starts of tests will not be reflective of overall performance
38. Performance Considerations
• Application
– Single vs. multiple workers
– Working set size – is it all in cache?
– Sequential workloads versus random workloads – cache friendly workload?
– Outstanding I/Os – have you a decent queue depth on the storage controllers?
– Block size – if synthetic, does it represent the typical application block size?
– Guest file system considerations – raw or not?
• VSAN
– Cache warm up considerations
– Number of magnetic disk drives/striping considerations
– Performance during failures and rebuild activity
CONFIDENTIAL 38
39. Performance Test with IOmeter
• Do NOT forget to warm the SSD before your performance test
• First test:
– Single worker
– < 8 Outstanding I/O
– Write I/O Data Pattern will use repeating bytes
– 4KB I/O size
– 70% Read/30% Write
– 100% Random
• Consider moving, over time, to:
– multiple workers
– multiple VMs
– multiple hosts
– Increasing OIO – latency versus IOPS
CONFIDENTIAL 39
40. Virtual SAN Health Check Plugin – Proactive Storage Tests
• Run this performance test in a non-production environment
• It will create ~10-20 VMDKs per host which will be distributed by VSAN onto physical disks
and then issue a synthetic IO workload on all VMDKs on all hosts in parallel
• A way to validate IOPs and bandwidth requirements
CONFIDENTIAL 40
41. From PoC to Production
Day 2 Operation Considerations
42. Considerations
HA/DR
Monitoring
Operations
Design for Scaling
• Stretched Cluster
• Used of VR/SRM
• Setup Alarms
• Used vROps
• vSAN Health Plugin
• Maintenance Mode
• Workflow
• Third Party tools
• SSD/HD rebuild
• Script install
• Capacity planning
CONFIDENTIAL 43
43.
44.
45. Conducting a Successful Virtual SAN
Proof of Concept
Cormac Hogan, VMware, Inc
Julienne Pham, VMware, Inc
STO4572
#STO4572
46. Case #4 : Other ways of monitoring VSAN Activity
• VSAN Health Check Plugin
– Rerun tests and check if any of the many checks have failed
– Any checks that have failure will also generate an alarm (new in 6.1 version)
– Link to VMware KB via AskVMware to assist with troubleshooting
• vRealize Operations Management with storage pack for VSAN
– Ships with a number of preconfigured dashboards
– Surfaces up various events and warning that are specific to VSAN
– Provides troubleshooting guidance
• vRealize Log Insight
– Examines logs from VSAN events as well as VSAN traces
CONFIDENTIAL 49
47. Case #4 : Monitory VSAN Activity
CONFIDENTIAL 50
Number of Virtual SAN Cluster
Virtual Machine Object
Top Virtual SAN issues
Virtual SAN Alerts
VM Information through vROps
48. Case #4 : Monitoring VSAN Activity
CONFIDENTIAL 51
Magnetic disks used by this
Virtual SAN Cluster
Storage Performance
Disk latencies through vROps
49. Case #4 : Observing VSAN Activity
Host disconnected from the network
CONFIDENTIAL 52
Impact of failure on VSAN,
along with recommendations on
what to do next
Notas del editor
Most importantly since General Availability in March 2014, in just the first 9 months of selling we have over 1000 customers, including several brand names which will soon be added to this list.
Customers have been pleasantly surprised with how reliably VSAN has performed in their tier-1 production use cases; because it works as advertised customers are coming back to expand their VSAN deployments.
Introduced in early 2014.
Scale out architecture, starting with small, 3 node cluster and add nodes (ESXi hosts) as needed.
Uses local storage from each host – can have compute nodes but you need at least 3 nodes contributing storage
Full interop with vSphere features.
VM-centric data operations: mirroring, striping, cache reservation, pre-allocate disk space … all done on a per VM basis.
Work most on vSphere HW.
2 VSAN type : all flash/Hybrid
Are you prepared for the next generation of Storage?
What are you trying to do? << this is the important part! What do you want the PoC to achieve??? What will make a successful POC?
What type of applications are you planning to run on Virtual SAN?
What is the outcome are you trying to leverage?
Accelerate:
Are you ready to accelerate your business?
When you are looking for VSAN in PoC, you are in set of mind to disrupt your business and understand what is VSAN and how it will impact your business operation.
Use Case:
You want to archieve an objective – VIEW Use Case, Less OpEX, more flexible.
Planning:
Set time and resource to get a better understanding of the technology or involve Vmware in your PoC to gain time.
You want to gain the best of the PoC in short time frame to validate your objectives.
Outcome:
The most important as it will help you to :
Validate the PoC
Define if it failed or succeed to your expectations.
Decision point for adoptation
We need to add some notes here Jules, cos I’m not sure what the thought process is …
Virtual SAN is disrupted technology and change the business game.
So you will meet with some barriers
Culture Barrier:
IT Process
Networking Team, Storage Team and vSphere Team where is the barrier between those teams.
You have new set of tools that you have to be familiar
The IT workflow for troubleshooting is different.
New concept, new adventure
SAS expanders are now being qualified on a per ready-node basis.
Our initial goal with VSAN; use any components to build yourself a distributed storage solution. I am sure we would love nothing more than to just give our customers the VSAN software, and let them deploy it on whatever combination of host, controller, flash device that they want. In reality, this is simply not possible. We have found that there are too many inter-dependencies (and nuances in behavior) between controllers, drivers, driver firmware, magnetic disks, SSDs, PCI-e flash devices, flash device firmware, for this to happen. Stuff that is just supposed to work, but doesn’t. This is exactly why we started to qualify SAS expanders (and flash devices and driver and firmware versions). Its not that we’re trying to be difficult, its because we have encountered situations where these components “do something funky” and we want to protect our current customers (and future customers) from hitting these issues if they decide to roll out a VSAN solution. Maintaining a HCL is the only way we can offer our customers hardware choice while still ensuring that the components have been rigorously tested.
We will also have a number of stretched cluster checks introduced for VSAN 6.1 (shipping with vSphere 6.0U1)
We will also have a number of stretched cluster checks introduced for VSAN 6.1 (shipping with vSphere 6.0U1)
Troubleshooting guide link
Plans to include some functionalities in vSphere Web Client
GA to VMW?? vROPS Std? Adv version?
Log Insight can ingest any data in UTF-8, UTF-16BE, UTF-16LE.
However VSAN trace files are in binary today, so cannot be ingested. If you convert the binary then Log Insight could properly ingest it — note that all log analysis products in the market require non-binary data. ESXi does not forward VSAN trace data either today. VSAN traces cannot be ingested live from an ESXi host today without adding something to the ESXi host to handle the existing limitations.
The Log Insight team has been working with the VSAN team to address some of the limitations so Log Insight could be used for real-time analysis of VSAN trace data.
VSAN dashboards are in the vSphere content pack within Log Insight so you can analyze some VSAN — non-trace -- data today.
Serviceability – replacing drives, maintenance mode, rolling upgrades
What does it check?
That there isn’t some underlying hardware issue preventing a VM from being deployed with a default policy
That you don’t have some silly default policy that cannot be met by the configuration, e.g FTT=3
That ATS (Atomic Test & Set) locking is functioning
This is a VM with FTT=1. VMs with higher spec policies will have more components.
Ask if audience understand that a VM on VSAN is now a set of objects, not files.
Objects in turn or made up of components, which can be many depending on stripe width (RAID0) and failures to tolerate (RAID1)
Speaker notes:
Although we require a minimum of 3 nodes to a VSAN Cluster, a better approach might be to build 4 node clusters.
This way when there is a failure or more importantly a maintenance task which takes one node out of the cluster, you have the possibility of keeping your fault tolerance setting in place during this period, provided there is enough capacity left in the cluster.
Of course, rebuild activity will only occur when there are available resources.
In order to inject errors, the health check includes a feature to do this. It may need 3rd party tools installed.
Devices that are removed are considered “Absent”.
A timeout value defined by ClomdRepairDelay needs to expire before VSAN takes remedial action.
By default, this is 60 minutes.
This means that there is no rebuild activity until this timer expires.
Many ways to simulate a VSAN network failure otherwise:
Pull a cable
Remove uplinks from VSS or DVS
Remove VSAN VMkernel adapter
This was only visible in RVC in 5.5 – vsan.resync_dashboard
Limit to 100 VMs per ESXi.
Maintenance Mode places components on the host in an ABSENT state. Don’t do any further testing if a host is in maintenance mode if ensure accessibility or no data migration options chosen.
Keep in mind the requirement to have additional resources. Full Data migration won’t be possible with a 3 node cluster.
Risk in doing maintenance mode with 3 nodes only
Light LED on failures
When a disk hits a permanent error, it can be challenging to find where that disk sits in the chassis to find and replace it.
When SSD or MD encounters a permanent error, VSAN automatically turns the disk LED on.
Turn disk LED on/off
User might need to locate a disk so VSAN supports manually turning a SSD or MD LED on/off.
Marking a disk as SSD
Some SSDs might not be recognized as SSDs by ESX.
Disks can be tagged/untagged as SSDs
Marking a disk as local
Some SSDs/MDs might not be recognized by ESX as local disks.
Disks can be tagged/untagged as local disks.
Software-Defined Storage
Flexible resource management
Common control across heterogeneous resources
Granular VM-centric SLA management
VSAN Stats are not on vCenter – you need RVCor VROPS tools to get those information
A good overview of how to do valid storage performance testing - http://blogs.vmware.com/storage/2015/08/12/tips-storage-performance-testing/
A single VM will only consume resources on one host. Deploy multiple VMs.
We aim for 90% read cache hit rate. Of course on All-Flash VSAN, this isn’t an issue since read cache misses are services from flash too.
VSAN is a caching system. The idea is to keep your working set of your application/guest in cache. When considering hybrid storage configurations (e.g. mixed flash and disk), the most important factor will be to estimate the size of your “working set”, e.g. the proportion of your entire data set that will be actively accessed. Most observed working sets are less than 5% of the total dataset size, but there are exceptions. If your tests size your working set too large, you’ll get a less-than-ideal picture of hybrid performance that won’t begin correspond with reality.
Allow your benchmark to run for some time before starting to gather metrics.
Make the point about OIO. Is it IOPS or Latency is the goal? Lower OIO = lower latency, Higher OIO = more IOPS. Find a balance.
http://www.vmware.com/files/pdf/products/vsan/VMware-Virtual-SAN6-Proof-Of-Concept-Guide.pdf
Cannot compare previous application performance on existing infrastructure to the new one.
New Hardware/New storage specifications.
We used IOblazer internal tools to generate workload
There is a tool for that
Upgrade
Fault Domain Design
Stretched Cluster location
Depending of the physical setup
VMs volumes growth
Rebalance task trigger at 80% full
VSAN mode in Auto or Manual
Use script install for future server
Same make model and specifications
Update Host profile for Virtual SAN
Backup/DR
Used of vSphere Replication
Third party tools
Improved Storage Operations and locate physical disks slot