In this session Cormac Hogan and I go over the top 10 things to know about vSAN. This is based on two years of questions/answers from our field and customers. Useful for any VMware vSAN customer!
#STO1264BU #STO1264BE
The Ultimate Guide to Choosing WordPress Pros and Cons
VMworld 2017 - Top 10 things to know about vSAN
1. Cormac Hogan - @CormacJHogan
Duncan Epping - @DuncanYB
STO1264BU
#VMworld #STO1264BU
Top 10 things to know
about VMware vSAN
2. • This presentation may contain product features that are currently under development.
• This overview of new technology represents no commitment from VMware to deliver these
features in any generally available product.
• Features are subject to change, and must not be included in contracts, purchase orders, or
sales agreements of any kind.
• Technical feasibility and market demand will affect final delivery.
• Pricing and packaging for any new technologies or features discussed or presented have not
been determined.
Disclaimer
2
3. Agenda
1. vSAN is Object Storage (Duncan)
2. How vSAN survives Device, Host, Rack and Site Failures (Cormac)
3. Which features are not policy driven (Duncan)
4. The impact of changing a policy on the fly (Cormac)
5. Things you may not know about the Health Check (Duncan)
6. Troubleshooting options you may not know about (Cormac)
7. Things to know about dealing with disk replacements (Duncan)
8. The impact of unicast on vSAN network topologies (Cormac)
9. Getting the most out of Monitoring and Logging (Duncan)
10. Understanding congestion and how to avoid it (Cormac)
7. Failures To Tolerate (FTT) with RAID-1 Mirroring
• To tolerate N failures with RAID-1 (default) requires N+1 replicas
– Example: To tolerate 1 host or disk failure, 2 copies or “replicas” of data needed
• Challenge: “Split-brain” scenario
DataData
Host 1 Host 2 Host 3 Host 4
RAID-1 mirroring, FTT=1
8. Failures To Tolerate (FTT) with RAID-1 Mirroring - Witness
• To tolerate N failures requires N+1 replicas and 2N+1 hosts in the cluster
• For an object to remain active, requires >50% of components votes.
• If 2N+1 is not satisfied by components, witness added – serves as “tie-breaker”
• Witness component provides quorum.
Witness
DataData
Host 1 Host 2 Host 3 Host 4
RAID-1 mirroring, FTT=1
11. Stripe Width vs Chunking due to Component Size
• Number of Disk Stripes per Object has a dependency on the number of available capacity
devices per cluster
– Number of Disk Stripes per Object cannot be greater than number of capacity devices
– Striped components from same object cannot reside on the same physical disks
– Components are RAID-0 stripes
• vSAN ‘chunks’ components that are greater than 255GB into multiple smaller components
– Resulting components may reside on the same capacity device
– Components may be considered as RAID-0 concatenation
• vSAN also ‘chunks’ components when the capacity disk size is smaller than the requested
VMDK size, e.g. 250GB VMDK, but 200GB physical drive.
13. Handling Failures
esxi-01 esxi-02 esxi-03
vmdk
RAID-1
FTT=1
esxi-04
witnessvmdk
~50% of I/O ~50% of I/O
X
• If a host or a disk group or a disk
fails, there’s still a copy left of the
data
• With Failures To Tolerate (FTT)
you specify how many failures you
can tolerate
– With RAID-1 one can protect
against 3 failures
– With RAID-6 one can protect
against 2 failures
– With RAID-5, one can protect
against 1 failure
14. FD2/RACK2
esxi-03
esxi-04
Fault Domains, increasing availability through rack awareness
• Create fault domains to increase availability - 8 node cluster with 4 fault domains (2 nodes in each)
FD1 = esxi-01, esxi-02 : FD3 = esxi-05, esxi-06 : FD2 = esxi-03, esxi-04 : FD4 = esxi-7, esxi-08
• To protect against one rack failure only 2 replicas are required and a witness across 3 failure domains!
14
FD3/RACK3
esxi-05
esxi-06
FD4/RACK4
esxi-07
esxi-08
esxi-01
esxi-02
FD1/RACK1
vmdk vmdk witness
RAID-1
X
16. Local and Remote Protection for Stretched Clusters
vSphere vSAN
ClusterCluster
5ms RTT, 10GbE
• Redundancy locally and
across sites with vSAN 6.6
• With site failure, vSAN
maintains availability with
local redundancy in surviving
site
• Individual device or host
failures can still be tolerated
even when only a single site
is available
• Not all VMs need protection in
vSAN stretched cluster
• Some VMs can be deployed
with zero Failures To Tolerate
policy driven using SPBM
RAID-6
3rd site for
witness
RAID-6
RAID-1
X
17. How many failures can I tolerate?
Description
Primary
FTT
Secondary
FTT
FTM
Hosts
per site
Stretched
Config
Single site
capacity
Total cluster
capacity
Standard Stretched across locations
with local protection
1 1 RAID-1 3 3+3+1 200% of VM 400% of VM
Standard Stretched across locations
with local RAID-5
1 1 RAID-5 4 4+4+1 133% of VM 266% of VM
Standard Stretched across locations
with local RAID-6
1 2 RAID-6 6 6+6+1 150% of VM 300% of VM
Standard Stretched across locations
no local protection
1 0 RAID-1 1 1+1+1 100% of VM 200% of VM
Not stretched, only local RAID-1 0 1 RAID-1 3 n/a 200% of VM n/a
Not stretched, only local RAID-5 0 1 RAID-5 4 n/a 133% of VM n/a
Not stretched, only local RAID-6 0 2 RAID-6 6 n/a 150% of VM n/a
20. • Nearline deduplication and compression per
disk group level
– Enabled on a cluster level
– Deduplicated when de-staging from cache tier
to capacity tier
– Fixed block length deduplication (4KB Blocks)
• Compression after deduplication
– If block is compressed <= 2KB
– Otherwise full 4KB block is stored
Beta
Deduplication and Compression for Space Efficiency
SSD
SSD
1. VM issues write
2. Write acknowledged by cache
3. Cold data to memory
4. Deduplication
5. Compression
6. Data written to capacity
21. vSAN Data-at-Rest Encryption
• Datastore level, data-at-rest encryption
for all objects on vSAN Datastore
• No need for self encrypting drives
(SEDs), reducing cost and complexity
• Works with all vSAN features, including
dedupe and compression
• Integrates with all KMIP compliant key
management technologies (Check HCL)
SSD
SSD
1. VM issues write
2. Written encrypted to cache
3. When destaged >> decrypted
4. Deduplication
5. Compression
6. Encrypt
7. Data written to capacity
22. Thin Swap aka Sparse Swap
• By default Swap is fully reserved on vSAN
• Which means that a VM with 8GB of memory takes 16GB of Disk Capacity
– RAID-1 is applied by default
• Did you know you can disable this?
– esxcfg-advcfg -s 1 /vSAN/SwapThickProvisionDisabled
• Only recommended when not overcommitting memory
25. Changing Failures To Tolerate (FTT) …
• “FTT” defines the number of
hosts, disk or network failures a
storage object can tolerate.
• For “n” failures tolerated with
RAID-1, “n+1” copies of the object
are created and “2n+1” hosts
contributing storage are required!
• When you increase FTT, required
disk capacity will go up!
• This can be done with no rebuild,
as long as you are not changing
the RAID protection type / Fault
Tolerance Method (FTM).
esxi-01 esxi-02 esxi-03
vmdk
RAID-1
FTT=1
… esxi-08
witnessvmdk
~50% of I/O ~50% of I/O
vmdk
26. Changing stripe width…
• Defines the minimum number of
capacity devices across which
each replica of a storage object
is distributed.
• Additional stripes ‘may’ result in
better performance, in areas
like write destaging, and fetching
of reads
• But a high stripe width may put
more constraints on flexibility
of meeting storage compliance
policies
• This will introduce a rebuild of
objects if the stripe width is
increased on-the-fly.
esxi-01 esxi-02 esxi-03
stripe-2a
RAID-1
esxi-04
witnessstripe-2b
RAID-0 RAID-0
stripe-1a
stripe-1b
FTT=1
Stripe width=2
27. Changing RAID protection …
• The ability to tolerate failures in
vSAN is provided by vSAN.
• Protections options are RAID-1,
RAID-5 and RAID-6.
• RAID-1 provides best
performance, but RAID-5 and
RAID-6 provide capacity
savings.
• Changing the RAID level will
introduce a rebuild of objects if
the stripe width is increased on-
the-fly.
esxi-01 esxi-02 esxi-03
Segment-3
RAID-5
esxi-04
Segment-4Segment-2Segment-1
FTT=1, EC=R5
28. Which policy changes require a rebuild?
Policy Change
Rebuild
Required?
Comment
Increasing/Decreasing Number of Failures To Tolerate No
As long as (a) RAID protection is unchanged and (b)
Read Cache Reservation = 0 (hybrid)
Enabling/Disabling checksum No
Increasing/Decreasing Stripe Width Yes
Changing RAID Protection Yes
RAID-1 to/from RAID-5/6, and vice-versa.
RAID-5 to/from RAID-6, and vice-versa.
Increasing the Object Space Reservation Yes
Object Space Reservation can only be 0% or 100%
when deduplication is enabled.
Changing the Read Cache Reservation Yes Applies to hybrid only
30. Introduced in vSAN 6.0
• Incorporated in the vSphere Web Client and Host Client!
• vSAN Health Check tool include:
– Limits checks
– vSAN HCL health
– Physical disk health
– Network health
– Stretched cluster health
31. But did you also know?
• Health check is weighted, look at top failure/problem first
• You can disable a health check through RVC
– Run it in interactive mode and disable what you need
vsan.health.silent_health_check_configure -i /localhost/vSAN-DC/computers/vSAN-Cluster
– Disable a specific test
vsan.health.silent_health_check_configure -a vmotionpingsmall /localhost/vSAN-DC/computers/vSAN-Cluster
– And check the status of all checks
vsan.health.silent_health_check_status /localhost/vSAN-DC/computers/vSAN-Cluster
32. .
Intelligent, Automated
Operations with vSAN Config Assist
• Simplify HCI Management with
prescriptive one-click controller firmware
and driver upgrades
• HCL aware. Pulls correct OEM firmware
and drivers for selected controllers from
participating vendors including Dell,
Lenovo, Fujitsu, and SuperMicro
• Validate and remediate software
configuration settings for vSAN
• Configuration wizards validate vSAN
settings and ensure best practice
compliance
vSphere vSAN
vSAN Datastore
33. Cloud-connected Performance Diagnostics
• Provides diagnostics for
benchmarking & PoCs
• Specify one of three predefined areas
of focus for benchmarks:
– Max IOPS
– Max Throughput
– Min Latency
• Integration into HCIBench
• Output automatically sent to cloud for
analysis
• Provides results of analysis in UI
• Detects issues and suggests
remediation steps by tying to specific
KB articles
vSphere vSAN
HCIBench
Analysis
Detect issues
Visible to GSS
Feedback
Site results
Links to KBs
vSAN Cloud Analytics
VMware Customer Experience Improvement Program
35. Host reboots are not troubleshooting steps!!!
You also don’t try to rip off your wheels while you driving?
36. CLI tools
• esxcli
• RVC – Ruby vSphere Console (available in your vCenter Server)
• PowerCLI
• cmmds-tool
• python /usr/lib/vmware/vsan/bin/vsan-health-status.pyc
36
This one is useful as it
decodes a lot of the
UUIDs
• RVC will eventually be deprecated, and cmmds-tool and vsan-health-status.pyc are
primarily support tools
• VMware has enhanced esxcli to provide further troubleshooting features
37. • [root@esxi-dell-b:/usr] esxcli vsan
Usage: esxcli vsan {cmd} [cmd options]
Available Namespaces:
cluster Commands for vSAN host cluster configuration
datastore Commands for vSAN datastore configuration
debug Commands for vSAN debugging
health Commands for vSAN Health
iscsi Commands for vSAN iSCSI target configuration
network Commands for vSAN host network configuration
resync Commands for vSAN resync configuration
storage Commands for vSAN physical storage configuration
faultdomain Commands for vSAN fault domain configuration
maintenancemode Commands for vSAN maintenance mode operation
policy Commands for vSAN storage policy configuration
trace Commands for vSAN trace configuration
esxcli commands (as of vSAN 6.6)
38. Get limits info via esxcli vsan debug
[root@esxi-dell-e:~] esxcli vsan debug limit get
Component Limit Health: green
Max Components: 9000
Free Components: 8990
Disk Free Space Health: green
Lowest Free Disk Space: 83 %
Used Disk Space: 192329932062 bytes
Used Disk Space (GB): 179.12 GB
Total Disk Space: 1515653332992 bytes
Total Disk Space (GB): 1411.56 GB
Read Cache Free Reservation Health: green
Reserved Read Cache Size: 0 bytes
Reserved Read Cache Size (GB): 0.00 GB
Total Read Cache Size: 0 bytes
Total Read Cache Size (GB): 0.00 GB
Read Cache is only
relevant to hybrid vSAN.
For AF-vSAN, this is the
expected output
39. Get controller info via esxcli vsan debug
[root@esxi-dell-e:~] esxcli vsan debug controller list
Device Name: vmhba1
Device Display Name: Intel Corporation Wellsburg AHCI Controller
Used By vSAN: false
PCI ID: 8086/8d62/1028/0601
Driver Name: vmw_ahci
Driver Version: 1.0.0-39vmw.650.1.26.5969303
Max Supported Queue Depth: 31
Device Name: vmhba0
Device Display Name: Avago (LSI) Dell PERC H730 Mini
Used By vSAN: true
PCI ID: 1000/005d/1028/1f49
Driver Name: lsi_mr3
Driver Version: 6.910.18.00-1vmw.650.0.0.4564106
Max Supported Queue Depth: 891
39
Information pertinent to
vSAN
41. Did you know there are two different component states?
Absent vs. Degraded
• Component marked Degraded when it is unlikely
component will return
– Failed drive (PDL)
– Rebuild starts immediately!
• Component marked Absent when component may return
• Host rebooted
• Drive is pulled
• Network partition / isolation
– Rebuild starts after 60 minutes
• Avoids unnecessary rebuilds
– Or simply click “repair object immediately
– Advanced setting: vSAN.ClomRepairDelay
42. Disk Failure, what happens?
• Deduplication and compression disabled
– If capacity device fails, then drive is unmounted, disk group stays online
• Exception: Only one capacity device, then DG cannot remain available
– If cache device fails, then entire disk group is unmounted
• Deduplication and compression enabled
– Any device in disk group fails, then disk group is unmounted
43. Did you know we detect devices degrading?
Degraded Device Handling
• Smarter intelligence in detecting impending
drive failures
• If replica exists, components on suspect
device marked as “absent” with standard
repair process
• If last replica, proactive evacuation of
components occurs on suspect device
• Any evacuation failures will be shown in UI
44. What then? Place it in maintenance mode!
• Conducts a precheck for free space
prior to maintenance mode
decommissioning
• Dialog and report shown prior to
entering maintenance mode
• Decommission check occurs for disk
and disk group removals
• Increased granularity for faster, more
efficient decommissioning
• Reduced amount of temporary space
during decommissioning effort
46. Multicast introduces complexity
• Multicast needs to be enabled on the switch/routers of the
physical network.
• Internet Group Management Protocol (IGMP) used
within an L2 domain for group membership (follow
switch vendor recommendations)
• Protocol Independent Multicast (PIM) used for
routing multicast traffic to a different L3 domain
• VMware received a lot of feedback from our customers on
how they would like to see multicast removed as a
requirement from vSAN.
• In vSAN 6.6, all multicast traffic was moved to unicast.
47. Simplified, Cloud-friendly Networking with Unicast
• Multicast no longer used
• Easier configurations for single
site and stretched clusters
• vSAN changes over to unicast
when cluster upgrade to 6.6
completes
• No compromises in CPU
utilization
• Little effect on network traffic
when using unicast over multicast
vSphere vSAN
Multicast
48. Member Coordination with Unicast
• vSAN cluster IP address list is maintained by vCenter and is
pushed to each node.
• This includes
– vSAN cluster UUID, vSAN node UUID, IP address and
unicast port
– Witness Node or Data Node
– Does the node support unicast or not?
• The following changes will trigger an update from vCenter:
– A vSAN cluster is formed
– A new vSAN node is added or removed from vSAN enabled
cluster
– An IP address change or vSAN UUID change on an existing
node
vCenter
49. Upgrade / Mixed Cluster Considerations with unicast
vSAN Cluster
Software
Configuration
Disk Format
Version(s)
CMMDS Mode Comments
6.6 Only Nodes* Version 5 Unicast
Permanently operates in unicast.
Cannot switch to multicast.
6.6 Only Nodes*
Version 3 or
below
Unicast
6.6 nodes operate in unicast mode.
Switches to multicast when < vSAN 6.6 node added.
Mixed 6.6 and
vSAN pre-6.6
Nodes
Version 5
(Version 3 or
below)
Unicast
6.6 nodes with v5 disks operate in unicast mode. Pre-
6.6 nodes with v3 disks will operate in multicast mode.
This will cause a cluster partition!
Mixed 6.6 and
vSAN pre-6.6
Nodes
Version 3 or
Below
Multicast
Cluster operates in multicast mode. All vSAN nodes
must be upgraded to 6.6 to switch to unicast mode. Disk
format upgrade to v5 will make unicast mode
permanent.
Mixed 6.6 and
vSAN 5.X Nodes
Version 1 Multicast
Operates in multicast mode. All vSAN nodes must be
upgraded to 6.6 to switch to unicast mode. Disk format
upgrade to v5 will make unicast mode permanent.
Mixed 6.6 and
vSAN 5.X Nodes
Version 5
(Version 1)
Unicast
6.6 nodes operate in unicast mode.
5.x nodes with v1 disks operate in multicast mode. This
will cause a cluster partition!
50. vSAN 6.6 only nodes – additional considerations with unicast
• A uniform vSAN 6.6+ cluster will communicate using unicast, even when disk-groups that are
not formatted with version 5 are present
• If no host has on-disk format version 5, vSAN will revert from unicast to multicast mode if a non-
vSAN 6.6 node is added to the cluster.
• If a vSAN 6.6+ clusters has at least one node has a v5.0 on-disk format, it will only ever
communicate in unicast.
• This means that if a non-vSAN 6.6 node added to this cluster will not be able to communicate
with the vSAN 6.6 nodes
– this non-vSAN 6.6 node will be partitioned
52. Health Check is awesome, but provides you
with current info. What about historic data
and trends?
Monitor your environment closely with:
• Web Client
• vCenter VOBs
• VROps
• LogInsight
• Sexigraf (http://www.sexigraf.fr/)
• Or anything else that you want to use
53. Add custom alarms using VOBs
• VOB >> VMkernel Observation
• You can check the following log to see what has been triggered:
/var/log/vobd.log
• You can find the full list of VOBs here:
/usr/lib/vmware/hostd/extensions/hostdiag/locale/en/event.vmsg
• For those you find useful you create a custom alarm!
https://docs.vmware.com/en/VMware-vSphere/6.0/com.vmware.vsphere.virtualsan.doc/GUID-FB21AEB8-204D-4B40-B154-42F58D332966.html
54. Predefined Dashboard – vSAN Operations Overview
• Metric “sparklines” showing
history of current state
• Similar layout across all
dashboards
– Aggregate cluster
statistics
– Cluster specific statistics
• Alert history
• Component limits
• CPU and Memory statistics
55. Log Insight Content Pack for vSAN
• Provides collection of dashboards and
widgets for immediate analysis of vSAN
activities
• Exposes non-error events to provide
context
• Content pack collects vSAN urgent
traces
• vSAN traces high volume, binary and
compressed
• Urgent traces automatically
decompressed, converted from binary to
human readable format. (vSAN 6.2 and
newer)
• [root@esx01:~] esxcli vsan trace
get
56. Did you know there’s an “interactive widget”?
Tip: Use Interactive Analytics
within widget
• Allows for a good starting point
of narrowing down focus
• Builds initial query for you
58. vSAN Architecture Overview
DiskLib : Opens VMDK for read and writing
Cluster Level Object Manager (CLOM)
Ensures cluster can implement the policy
Cluster Monitoring, Membership and Directory Services
(CMMDS)
Log Structured Object Manager (LSOM)
issues read/writes to physical disks
Distributed Object Manager (DOM)
applies policies
Reliable Datagram Transport (RDT)
for copying objects
59. Congestions – IO layer
Logical Log Physical Log
Caching Tier Capacity Tier
IO
LSOM
LLOG congestions PLOG congestions
Congestions are bottlenecks, meaning vSAN is running at reduced performance.
High latency on devices due to writes affects reads at the same time!
Reads/Writes
Overloaded tier
High latency
Overloaded tier
High latency
60. – SSD LLOG/PLOG congestions
• Overload on either the LLOG (caching tier) or the PLOG (capacity tier)
• Insufficient resources on the physical/hardware layer for the workload that is being tested/run
– High network errors (Observable via vSAN observer or new in vSAN 6.6 with vCenter)
• High errors indicates a low traffic flow or overload on the physical NICs
• Network over utilized, often seen on re-sync, or when vSAN network is shared with other aggressive
network traffic types, e.g. vMotion
– vSAN Software Layer
• De-staging from caching tier to capacity tier (with Deduplication/Compression)
• Checksum (< 6.0P04 or < 6.5ep02)
– CPU contention
• CPU contention in the ESXi scheduler, seen when wrong setup in BIOS (KB)
Possible causes of Congestions