vSAN Performance and Resiliency at Scale

vSAN Resiliency and
Performance @ Scale
Sumit Lahiri Product Line Manager
Eric Knauft Staff Engineer
#vmworld#HCI2427BU
HCI2427BU

Agenda
2©2018 VMware, Inc.
vSAN quick deep dive
I/O flow
Resynchronization
Availability
Performance Deep Dive : Eric

Disk layout in a single vSAN server
disk groupdisk group disk group disk group disk group
Disk groups contribute to single vSAN datastore in vSphere cluster
Cache
Capacity
vSAN Datastore
§ Max 64 nodes
§ Min 2 nodes (ROBO)
§ Max 5 Disk Groups per
host
§ 2 – Tiers per Disk Group

vSAN very quick Overview
vSAN Datastore
§ Pools local storage intoa
single resource pool
§ Delivers Enterprise grade
scale & performance
§ Managed through policies
§ Integrates compute& storage
management into a single
pane

vSAN Component Layout
VMDK (512GB)
R1
R0 R0
C1 C2(components)
(RAID-1)
(RAID-0)
C1 C2(components)
R0
C1 C2(components)
HFT =2, FTM = RAID-1 , Stripe Width = 2
Note: No blocks are allocated at this time
(SIZE = 256GB)
Witness components not shown
(RAID-0) (RAID-0)

Each replica on different Fault Domain (e.g. host)
R1
R0 R0
C1 C2(components)
(RAID-1)
(RAID-0)
C1 C2(components)
R0
C1 C2(components)
HFT =2, FTM = RAID-1 , Stripe Width = 2
(SIZE = 256GB)
Witness components not shown
VMDK (512GB)
(RAID-0) (RAID-0)
(BLOCKS:4MB)

CMMDS: Maintains inventory of all things vSAN
C: Cluster
M: Membership
M: Monitoring
D: Directory
S: Service
v Distributed Directory Service
v In-memory
v Persisted on disk
v Elects Object Owner
vSAN Objects and
Placement
Storage
Policies
RAID
Configurations
Cluster Membership

Master receives updates from all other nodes
Backup Node
Agent Node
Master Node
vSAN Objects and
Placement
Storage
Policies
RAID
Configurations
Receives updates
from
all hosts in the
cluster
Other Nodes subscribe for object
specific updates
Cluster Membership

CLOM : ensures object has configuration that matches the policy
CLOM: cluster level object manager
C
C
C
C
C
CLOM: Cluster level Object Manager
v One per node
v Find placement configuration that will meet the
policy
v Needs to be aware of the placement of all objects on
the node
v Communicates with CMMDS service running on the
same node
Master
C C
C C

DOM : Manages I/O flow from VM
DOM: Distributed Object Manager
C
C
C
C
DOM : Distributed Object Manager
v One per object
v Implements the placement configuration prescribed by
CLOM
v Ensure Object consistency (creation, rebuild &
reconfiguration)
v Implements distributed RAID logic
Master
One per Object
C C
C C

Schematic Representation of single VMDK deployment
Steady state layout
C1
C2
C1
C2
W
Master
DOM:Distributed object owner

Each partition elects its CMMDS
C1
C2
C1
C2
W
Master
DOM:Distributed object owner Each partition elects it’s Master
Objects has Quorum
andAvailability
1
2
Partition-01 Partition-02
DOM owner created
In-accessible state

Each partition elects its CMMDS when there is a network partition
C1
C2
C1
C2
W
Master
DOM : Distributed object owner 1
2
VM HA to the partition that meets the
liveness criteria
Partition meets the liveness
criteria for object
Partition-01 Partition-02
Each partition elects it’s Master
Object has Quorum
& Availability

Agenda
I/O flow
Resync
Availability

All Flash I/O flow: architecturallayout
H1 H2 H3
VMDK
Cache Tier
Capacity Tier
Replica -1 Replica -2
Capacity Tier

All Flash I/O flow: DOM and LSOM
H1 H2 H3
VMDK
Cache Tier
Capacity Tier
Log structured object manager

All Flash I/O flow: I/O issued by VM
H1 H1
1
VMDK vSAN Object
VM issues write DOM: one per object
VM DOM LSOM

All Flash I/O flow: DOM checks for free space
H1 H1
1
VMDK vSAN Object
2
VM issues write
v Check for conflicting I/Os on the
same I/O range
v and serialize the request
VM DOM LSOM

All Flash I/O flow: DOM sends prepare request to LSOM
H1 H1
1 VM issues write
VMDK vSAN Object
2
Check for conflicting I/Os
3
3
Send prepare
request to LSOM
VM DOM LSOM

All Flash I/O flow: LSOM commits to cache
H1 H1
1 VM issues write
VMDK vSAN Object
2
3
3
Send prepare request
to LSOM
4
v LSOM commits to cache
v No Dedupe
4
VM DOM LSOM

All Flash I/O flow: CMMDS master is not on the I/O path
H1 H1
1 VM issues write
VMDK vSAN Object
2
3
3
to LSOM
4
4
VM DOM LSOM
I/O flow doesn’t go through the CMMDS master
LSOM commits
to cache

All Flash I/O flow: I/O ack propagated back to VM
H1 H1
1 VM issues write
VMDK vSAN Object
2
3
3
to LSOM
4
LSOM commits
to cache
4
5
SendsAck back
to DOM
6
SendsAck back
to VM
VM DOM LSOM

All Flash I/O flow: DOM sends ack back to LSOM
H1 H1
1 VM issues write
VMDK vSAN Object
2
3
3
to LSOM
4
LSOM commits
to cache
4
5
SendsAck back
to DOM6
SendsAck back
to VM
VM DOM LSOM
DOM sends ack back to LSOM7

All Flash I/O flow: Elevator de-stages to capacity
VMDK vSAN Object
1 Block Allocation:
Is Allocated?
Over-write block
Allocate logical
block at 4MB chunk
NOYes
2 Dedupe, compress, encrypt
3 Write to media @ 4KB chunk

Agenda
I/O flow
Resync
Availability

Schematic representation of how Resync works
Example of full Resync
R1
R0 R0
W
C2C2C2C2
Degraded state
Witness component

Full Resync is initiated
R1
R0 R0
W
C2C2C2C2
Witness component
A
R0
C1 C2
Begin Resync
Begin ResyncDegraded state

Full Resync completesand degradedcomponent is marked for deletion
R1
R0 R0
W
C2C2C2C2
Witness component
R0
C1 C2
Marked for deletion
B Resync completes &
degraded components are marked for deletion
Degraded state

By contrast partial rebuilds have fewer blocks to resync
R1
R0 R0
W
C2C2C2C2
Degraded state
Witness component
Partial Repair

Examples of partial rebuild
R0
C2C2
Degraded state
Partial Repair
R0
C2C2
A
R0
C1 C2
Begin Resync
Begin Resync
Partial Rebuild Full Rebuild
Host comes out of maintenance mode
Recovery from transient failure
Partial or full
reconstruction
of RAID tree
v Block level copy
v No RAID tree construction

Examples of rebuilds
R0
C2C2
Degraded state
Partial Repair
R0
C2C2
A
R0
C1 C2
Begin Resync
Begin Resync
Permanent disk or host failure
Disk Rebalancing
Delta Writes

Finally changing storage config is full rebuild
R0
C2C2
Degraded state
Partial Repair
R0
C2C2
A
R0
C1 C2
Begin Resync
Begin Resync
Permanent disk or host failure
Disk Rebalancing
Delta Writes
Storage policy change

Agenda
I/O flow
Resync
Availability

First permanent failure initiates rebuild
Replica -3
Event 1: The first host
is down
1
2 vSAN begins full
rebuild

Intuition on planning for Availability
Probability of Availability Impact is:
Joint probability of:
v First failure followed by
v at least 2 more failures before rebuild
completes

Factors affectingAvailability
Probability of
component failure
v Type of failure: disk,
disk group, server
v Size of the cluster
v MTBF ratings

Probability of
component failure
v Scope of failure: disk,
disk group, server
v MTBF ratings
Data to Resync
v Duration of failure:
permanent vs. transient
disk group and server

Probability of
component failure
disk group, server
v MTBF ratings
Data to Resync
v Duration of failure:
permanent vs. transient
disk group and server
Time to Resync
v Size of Cluster: larger
cluster have higher
resync parallelization
v Resync bandwidth
allocation

v Select enterprise grade drives
with higher endurance and
higher MTBFs
v Degraded device handling
Approaches to improvingAvailability (and Durability)
Reduce Component Failures

higher MTBFs
v CLOM repair delay settings
v Avoid policy changes
v Point Fix
v Smart Repairs
v What-if Assessments
Reduce Component Failures Amount of data to Resync

higher MTBFs
v CLOM repair delay settings
v Avoid policy changes
v Point Fix
v Smart Repairs
v What-if Assessments
v Adaptive Resynchronization
v General performance
Improvements
Reduce Component Failures Amount of data to Resync Resync ETAs

Performance Deep Dive
Agenda
• Performance Fundamentals
• Adaptive Resync Architecture
• Monitoring Tools

Write BufferArchitecture
Writes go to a first tier device in a fast sequential log
Native device bandwidth to absorb short bursts
Cold data is deduplicated and compressed as it moves out to
second tier
Guest Writes First Tier
Capacity Tier
destaging

Write BufferArchitecture
Writes go to a first tier device in a fast sequential log
Native device bandwidth to absorb short bursts
Cold data is deduplicated and compressed as it moves out to
second tier
This de-staging process is slower than first tier writes
If we have sustained write workloads, we need to smoothly find
equilibrium
Capacity Tier
destaging
Time
Bandwidth
1st
Tier Bandwidth
Capacity Tier Bandwidth

Congestion In Action (Pre-Adaptive Resync)
We make this transition via a congestion signal
Congestion is adaptive – apply a greater throttle until we reach
equilibrium
Congestion stops rising when incoming rate equals de-staging
rate
Capacity Tier
destaging
Time
Bandwidth
1st
Tier Bandwidth
Capacity Tier Bandwidth
CongestionEquilibrium

Storage devices have some parallelism, but thereis a limit
At first, more outstanding IO means more bandwidth (same latency)
Once we hit max parallelism, more outstanding IO means more latency (same bandwidth)
Queueing Delay
Is high latency a hardware problem or a sizing problem?
Outstanding IO
Bandwidth
Outstanding IO
Latency

Often high latency is the most visible symptom
Queueing Delay
Is high latency a hardware problem or a sizing problem?
Outstanding IO
Bandwidth
Outstanding IO
Latency
Did we push the system to far?
Or is there an issue with hardware

Before: The more resyncs were happening, the larger the share of destage bandwidth.
• Many resyncs + low workload → drive up latency of vm IO
• Few resyncs + high workload → resync takes a long time
Adaptive Resync: resync should get 20% of the bandwidth (if contended)
• We can use more if the guest IO is underutilizing the device
Upgrade, policy change, rebalance should not be scary or take too long due to unfairness.
Adaptive Resync Customer Visible Before-and-After

We are using Congestion to provide three different properties:
• Discover the bandwidth of the devices
• Fairly balance different classes of IO (80% guest IO, 20% resync IO)
• Push back on clients to slow down
New approach: have a separate layer for
each guarantee.
What does Congestion try to do beforeAdaptive Resync
Bandwidth Regulator
Fairness Scheduler
Back Pressure
Backend

Adaptive Resync Deep Dive
Per Disk-Group scheduler
Bandwidth regulator discover the destaging rate
• Adaptive signal: write buffer fill
• Adaptive throttle: bandwidth limit
Dispatch Scheduler fairly balances different
classes of IO
• (80% guest IO, 20% resync IO)
Back pressure congestion pushes back on clients to
slow down
• Adaptive signal: scheduler queue fill
• Adaptive throttle: latency per op
Bandwith RegulatorDOM
LSOM Fullness signal
(LSOM congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Adaptive
Adaptive

We want to fairly share the write buffer
Adaptively discover the bandwidth
• Adding latency is not fair
Fairly share between IO Classes
• Resync
• VM
• Namespace
Easy because you can see what’s waiting.
Difficult to share bandwidth across hosts
Can’t see across the wire into what’s waiting on the
other side
Need to allocate and reclaim shares.
• Complex timing based
Instead we use latency
• Don’t need to see what’s waiting
Manage Write Buffer Fullness Put Backpressure on the Clients
The Technical Challenges

And you can monitor this all in
vSphere
We’ll show the graphs at every layer

Cluster LevelView
Sequential Write Workload

Diagram
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill

Virtual MachineView
Sequential Write Workload

xxx
xxx
xxx
xxx
xxx

Diving into the backend
Answer the following questions:
• Too many Outstanding IO?
• Is it first tier latency?
• Is it de-staging latency?
• Device or Network Issue?

Diagram
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
The top half shows if we have
too much Outstanding IO

Diagram
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
This is where we can see if it is
a sizing issue (too much IO
queuing up)

Diagram
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
A very high amount of
Outstanding IO causes
backpressure congestion

Diagram
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Backend = Latency including
queues and below

Diagram
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Disk Groups are where we
see first tier latency

Diagram
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
This is where we see the de-
stage rate

Diagram
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Disk Groups congestion
shows the signal from
LSOM

Diagram
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Disk Groups congestion
comes from WB Fill Also:
• Many log entries (small
writes, many objects)
• Component Congestion
(small writes, one object)
• Memory usage (rare)

Diving into the backend
Answer the following questions:
• Is it first tier performance?
• Is it de-staging performance?
• Too many Outstanding IO?
• Device or Network Issue?

• Should be in 4:1 ratio
• Ratio is measured on
normalized bandwidth (penalty
for small IOs)
• If one type is not using the
whole bandwidth, he other can
claim the leftover

Resync Fairness Applies even
when we have congestion

Now you can upgrade and do
maintenancewith peace of mind

Get Ahead of the Curve – vSAN Private Beta
vSAN Data Protection
Native enterprise-grade
protection
vSAN File Services
Expanding vSAN beyond
block storage
Cloud Native Storage
Persistent storage for
containers
Sign up at http://www.vmware.com/go/vsan-beta

vSAN Performance and Resiliency at Scale

vSAN Performance and Resiliency at Scale

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a vSAN Performance and Resiliency at Scale

Similar a vSAN Performance and Resiliency at Scale (20)

Último

Último (20)

vSAN Performance and Resiliency at Scale