Más contenido relacionado La actualidad más candente (20) Similar a vSAN Performance and Resiliency at Scale (20) vSAN Performance and Resiliency at Scale3. Disk layout in a single vSAN server
disk groupdisk group disk group disk group disk group
Disk groups contribute to single vSAN datastore in vSphere cluster
Cache
Capacity
vSAN Datastore
§ Max 64 nodes
§ Min 2 nodes (ROBO)
§ Max 5 Disk Groups per
host
§ 2 – Tiers per Disk Group
4. vSAN very quick Overview
vSAN Datastore
§ Pools local storage intoa
single resource pool
§ Delivers Enterprise grade
scale & performance
§ Managed through policies
§ Integrates compute& storage
management into a single
pane
5. vSAN Component Layout
VMDK (512GB)
R1
R0 R0
C1 C2(components)
(RAID-1)
(RAID-0)
C1 C2(components)
R0
C1 C2(components)
HFT =2, FTM = RAID-1 , Stripe Width = 2
Note: No blocks are allocated at this time
(SIZE = 256GB)
Witness components not shown
(RAID-0) (RAID-0)
6. Each replica on different Fault Domain (e.g. host)
R1
R0 R0
C1 C2(components)
(RAID-1)
(RAID-0)
C1 C2(components)
R0
C1 C2(components)
HFT =2, FTM = RAID-1 , Stripe Width = 2
(SIZE = 256GB)
Witness components not shown
VMDK (512GB)
(RAID-0) (RAID-0)
(BLOCKS:4MB)
7. 7©2018 VMware, Inc.
CMMDS: Maintains inventory of all things vSAN
C: Cluster
M: Membership
M: Monitoring
D: Directory
S: Service
v Distributed Directory Service
v In-memory
v Persisted on disk
v Elects Object Owner
vSAN Objects and
Placement
Storage
Policies
RAID
Configurations
Cluster Membership
8. 8©2018 VMware, Inc.
Master receives updates from all other nodes
Backup Node
Agent Node
Master Node
vSAN Objects and
Placement
Storage
Policies
RAID
Configurations
Receives updates
from
all hosts in the
cluster
Other Nodes subscribe for object
specific updates
Cluster Membership
9. 9©2018 VMware, Inc.
CLOM : ensures object has configuration that matches the policy
CLOM: cluster level object manager
C
C
C
C
C
CLOM: Cluster level Object Manager
v One per node
v Find placement configuration that will meet the
policy
v Needs to be aware of the placement of all objects on
the node
v Communicates with CMMDS service running on the
same node
Master
C C
C C
10. 10©2018 VMware, Inc.
DOM : Manages I/O flow from VM
DOM: Distributed Object Manager
C
C
C
C
DOM : Distributed Object Manager
v One per object
v Implements the placement configuration prescribed by
CLOM
v Ensure Object consistency (creation, rebuild &
reconfiguration)
v Implements distributed RAID logic
Master
One per Object
C C
C C
12. 12©2018 VMware, Inc.
Schematic Representation of single VMDK deployment
Each partition elects its CMMDS
C1
C2
C1
C2
W
Master
DOM:Distributed object owner Each partition elects it’s Master
Objects has Quorum
andAvailability
1
2
Partition-01 Partition-02
DOM owner created
In-accessible state
13. 13©2018 VMware, Inc.
Schematic Representation of single VMDK deployment
Each partition elects its CMMDS when there is a network partition
C1
C2
C1
C2
W
Master
DOM : Distributed object owner 1
2
VM HA to the partition that meets the
liveness criteria
Partition meets the liveness
criteria for object
Partition-01 Partition-02
Each partition elects it’s Master
Object has Quorum
& Availability
15. 15©2018 VMware, Inc.
All Flash I/O flow: architecturallayout
H1 H2 H3
VMDK
Cache Tier
Capacity Tier
Replica -1 Replica -2
Capacity Tier
DOM:Distributed object owner
16. 16©2018 VMware, Inc.
All Flash I/O flow: DOM and LSOM
H1 H2 H3
VMDK
Cache Tier
Capacity Tier
Replica -1 Replica -2
DOM:Distributed object owner
Log structured object manager
17. 17©2018 VMware, Inc.
All Flash I/O flow: I/O issued by VM
H1 H1
1
VMDK vSAN Object
VM issues write DOM: one per object
VM DOM LSOM
18. 18©2018 VMware, Inc.
All Flash I/O flow: DOM checks for free space
H1 H1
1
VMDK vSAN Object
2
VM issues write
v Check for conflicting I/Os on the
same I/O range
v and serialize the request
VM DOM LSOM
19. 19©2018 VMware, Inc.
All Flash I/O flow: DOM sends prepare request to LSOM
H1 H1
1 VM issues write
VMDK vSAN Object
2
Check for conflicting I/Os
3
3
Send prepare
request to LSOM
VM DOM LSOM
20. 20©2018 VMware, Inc.
All Flash I/O flow: LSOM commits to cache
H1 H1
1 VM issues write
VMDK vSAN Object
2
Check for conflicting I/Os
3
3
Send prepare request
to LSOM
4
v LSOM commits to cache
v No Dedupe
4
VM DOM LSOM
21. 21©2018 VMware, Inc.
All Flash I/O flow: CMMDS master is not on the I/O path
H1 H1
1 VM issues write
VMDK vSAN Object
2
Check for conflicting I/Os
3
3
Send prepare request
to LSOM
4
4
VM DOM LSOM
I/O flow doesn’t go through the CMMDS master
LSOM commits
to cache
22. 22©2018 VMware, Inc.
All Flash I/O flow: I/O ack propagated back to VM
H1 H1
1 VM issues write
VMDK vSAN Object
2
Check for conflicting I/Os
3
3
Send prepare request
to LSOM
4
LSOM commits
to cache
4
5
SendsAck back
to DOM
6
SendsAck back
to VM
VM DOM LSOM
23. 23©2018 VMware, Inc.
All Flash I/O flow: DOM sends ack back to LSOM
H1 H1
1 VM issues write
VMDK vSAN Object
2
3
3
Send prepare request
to LSOM
4
LSOM commits
to cache
4
5
SendsAck back
to DOM6
SendsAck back
to VM
VM DOM LSOM
Check for conflicting I/Os
DOM sends ack back to LSOM7
24. 24©2018 VMware, Inc.
All Flash I/O flow: Elevator de-stages to capacity
VMDK vSAN Object
1 Block Allocation:
Is Allocated?
Over-write block
Allocate logical
block at 4MB chunk
NOYes
2 Dedupe, compress, encrypt
3 Write to media @ 4KB chunk
27. 27©2018 VMware, Inc.
Schematic representation of how Resync works
Full Resync is initiated
R1
R0 R0
W
C2C2C2C2
Witness component
A
R0
C1 C2
Begin Resync
Begin ResyncDegraded state
28. 28©2018 VMware, Inc.
Schematic representation of how Resync works
Full Resync completesand degradedcomponent is marked for deletion
R1
R0 R0
W
C2C2C2C2
Witness component
R0
C1 C2
Marked for deletion
B Resync completes &
degraded components are marked for deletion
Degraded state
29. 29©2018 VMware, Inc.
Schematic representation of how Resync works
By contrast partial rebuilds have fewer blocks to resync
R1
R0 R0
W
C2C2C2C2
Degraded state
Witness component
Partial Repair
30. 30©2018 VMware, Inc.
Examples of partial rebuild
R0
C2C2
Degraded state
Partial Repair
R0
C2C2
A
R0
C1 C2
Begin Resync
Begin Resync
Partial Rebuild Full Rebuild
Host comes out of maintenance mode
Recovery from transient failure
Partial or full
reconstruction
of RAID tree
v Block level copy
v No RAID tree construction
31. 31©2018 VMware, Inc.
Examples of rebuilds
R0
C2C2
Degraded state
Partial Repair
R0
C2C2
A
R0
C1 C2
Begin Resync
Begin Resync
Partial Rebuild Full Rebuild
Host comes out of maintenance mode
Recovery from transient failure
Permanent disk or host failure
Disk Rebalancing
Delta Writes
32. 32©2018 VMware, Inc.
Finally changing storage config is full rebuild
R0
C2C2
Degraded state
Partial Repair
R0
C2C2
A
R0
C1 C2
Begin Resync
Begin Resync
Partial Rebuild Full Rebuild
Host comes out of maintenance mode
Recovery from transient failure
Permanent disk or host failure
Disk Rebalancing
Delta Writes
Storage policy change
34. 34©2018 VMware, Inc.
First permanent failure initiates rebuild
Replica -1 Replica -2
Replica -3
Event 1: The first host
is down
1
2 vSAN begins full
rebuild
35. 35©2018 VMware, Inc.
Intuition on planning for Availability
Probability of Availability Impact is:
Joint probability of:
v First failure followed by
v at least 2 more failures before rebuild
completes
36. 36©2018 VMware, Inc.
Factors affectingAvailability
Probability of
component failure
v Type of failure: disk,
disk group, server
v Size of the cluster
v MTBF ratings
37. 37©2018 VMware, Inc.
Factors affectingAvailability
Probability of
component failure
v Scope of failure: disk,
disk group, server
v Size of the cluster
v MTBF ratings
Data to Resync
v Duration of failure:
permanent vs. transient
v Type of failure: disk,
disk group and server
38. 38©2018 VMware, Inc.
Factors affectingAvailability
Probability of
component failure
v Type of failure: disk,
disk group, server
v Size of the cluster
v MTBF ratings
Data to Resync
v Duration of failure:
permanent vs. transient
v Type of failure: disk,
disk group and server
Time to Resync
v Size of Cluster: larger
cluster have higher
resync parallelization
v Resync bandwidth
allocation
39. 39©2018 VMware, Inc.
v Select enterprise grade drives
with higher endurance and
higher MTBFs
v Degraded device handling
Approaches to improvingAvailability (and Durability)
Reduce Component Failures
40. 40©2018 VMware, Inc.
v Select enterprise grade drives
with higher endurance and
higher MTBFs
v Degraded device handling
v CLOM repair delay settings
v Avoid policy changes
v Point Fix
v Smart Repairs
v What-if Assessments
Approaches to improvingAvailability (and Durability)
Reduce Component Failures Amount of data to Resync
41. 41©2018 VMware, Inc.
v Select enterprise grade drives
with higher endurance and
higher MTBFs
v Degraded device handling
v CLOM repair delay settings
v Avoid policy changes
v Point Fix
v Smart Repairs
v What-if Assessments
v Adaptive Resynchronization
v General performance
Improvements
Approaches to improvingAvailability (and Durability)
Reduce Component Failures Amount of data to Resync Resync ETAs
43. 43©2018 VMware, Inc.
Write BufferArchitecture
Writes go to a first tier device in a fast sequential log
Native device bandwidth to absorb short bursts
Cold data is deduplicated and compressed as it moves out to
second tier
Guest Writes First Tier
Capacity Tier
destaging
44. 44©2018 VMware, Inc.
Write BufferArchitecture
Writes go to a first tier device in a fast sequential log
Native device bandwidth to absorb short bursts
Cold data is deduplicated and compressed as it moves out to
second tier
This de-staging process is slower than first tier writes
If we have sustained write workloads, we need to smoothly find
equilibrium
Guest Writes First Tier
Capacity Tier
destaging
Time
Bandwidth
1st
Tier Bandwidth
Capacity Tier Bandwidth
45. 45©2018 VMware, Inc.
Congestion In Action (Pre-Adaptive Resync)
We make this transition via a congestion signal
Congestion is adaptive – apply a greater throttle until we reach
equilibrium
Congestion stops rising when incoming rate equals de-staging
rate
Guest Writes First Tier
Capacity Tier
destaging
Time
Bandwidth
1st
Tier Bandwidth
Capacity Tier Bandwidth
CongestionEquilibrium
46. 46©2018 VMware, Inc.
Storage devices have some parallelism, but thereis a limit
At first, more outstanding IO means more bandwidth (same latency)
Once we hit max parallelism, more outstanding IO means more latency (same bandwidth)
Queueing Delay
Is high latency a hardware problem or a sizing problem?
Outstanding IO
Bandwidth
Outstanding IO
Latency
Storage devices have some parallelism, but thereis a limit
At first, more outstanding IO means more bandwidth (same latency)
Once we hit max parallelism, more outstanding IO means more latency (same bandwidth)
Storage devices have some parallelism, but thereis a limit
At first, more outstanding IO means more bandwidth (same latency)
Once we hit max parallelism, more outstanding IO means more latency (same bandwidth)
47. 47©2018 VMware, Inc.
Often high latency is the most visible symptom
Queueing Delay
Is high latency a hardware problem or a sizing problem?
Outstanding IO
Bandwidth
Outstanding IO
Latency
Did we push the system to far?
Or is there an issue with hardware
48. 48©2018 VMware, Inc.
Before: The more resyncs were happening, the larger the share of destage bandwidth.
• Many resyncs + low workload → drive up latency of vm IO
• Few resyncs + high workload → resync takes a long time
Adaptive Resync: resync should get 20% of the bandwidth (if contended)
• We can use more if the guest IO is underutilizing the device
Upgrade, policy change, rebalance should not be scary or take too long due to unfairness.
Adaptive Resync Customer Visible Before-and-After
49. 49©2018 VMware, Inc.
We are using Congestion to provide three different properties:
• Discover the bandwidth of the devices
• Fairly balance different classes of IO (80% guest IO, 20% resync IO)
• Push back on clients to slow down
New approach: have a separate layer for
each guarantee.
What does Congestion try to do beforeAdaptive Resync
Bandwidth Regulator
Fairness Scheduler
Back Pressure
Backend
50. 50©2018 VMware, Inc.
Adaptive Resync Deep Dive
Per Disk-Group scheduler
Bandwidth regulator discover the destaging rate
• Adaptive signal: write buffer fill
• Adaptive throttle: bandwidth limit
Dispatch Scheduler fairly balances different
classes of IO
• (80% guest IO, 20% resync IO)
Back pressure congestion pushes back on clients to
slow down
• Adaptive signal: scheduler queue fill
• Adaptive throttle: latency per op
Bandwith RegulatorDOM
LSOM Fullness signal
(LSOM congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Adaptive
Adaptive
51. 51©2018 VMware, Inc.
We want to fairly share the write buffer
Adaptively discover the bandwidth
• Adding latency is not fair
Fairly share between IO Classes
• Resync
• VM
• Namespace
Easy because you can see what’s waiting.
Difficult to share bandwidth across hosts
Can’t see across the wire into what’s waiting on the
other side
Need to allocate and reclaim shares.
• Complex timing based
Instead we use latency
• Don’t need to see what’s waiting
Manage Write Buffer Fullness Put Backpressure on the Clients
The Technical Challenges
58. 58©2018 VMware, Inc.
Diving into the backend
Answer the following questions:
• Too many Outstanding IO?
• Is it first tier latency?
• Is it de-staging latency?
• Device or Network Issue?
59. 59©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
The top half shows if we have
too much Outstanding IO
60. 60©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
This is where we can see if it is
a sizing issue (too much IO
queuing up)
61. 61©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
A very high amount of
Outstanding IO causes
backpressure congestion
62. 62©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Backend = Latency including
queues and below
63. 63©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Disk Groups are where we
see first tier latency
64. 64©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
This is where we see the de-
stage rate
65. 65©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Disk Groups congestion
shows the signal from
LSOM
66. 66©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Disk Groups congestion
comes from WB Fill Also:
• Many log entries (small
writes, many objects)
• Component Congestion
(small writes, one object)
• Memory usage (rare)
67. 67©2018 VMware, Inc.
Diving into the backend
Answer the following questions:
• Is it first tier performance?
• Is it de-staging performance?
• Too many Outstanding IO?
• Device or Network Issue?
70. 70©2018 VMware, Inc.
• Should be in 4:1 ratio
• Ratio is measured on
normalized bandwidth (penalty
for small IOs)
• If one type is not using the
whole bandwidth, he other can
claim the leftover
73. 73©2018 VMware, Inc.
Get Ahead of the Curve – vSAN Private Beta
vSAN Data Protection
Native enterprise-grade
protection
vSAN File Services
Expanding vSAN beyond
block storage
Cloud Native Storage
Persistent storage for
containers
Sign up at http://www.vmware.com/go/vsan-beta