SlideShare una empresa de Scribd logo
1 de 76
vSAN Resiliency and
Performance @ Scale
Sumit Lahiri Product Line Manager
Eric Knauft Staff Engineer
#vmworld#HCI2427BU
HCI2427BU
Agenda
2©2018 VMware, Inc.
​vSAN quick deep dive
​I/O flow
​Resynchronization
​Availability
​Performance Deep Dive : Eric
Disk layout in a single vSAN server
disk groupdisk group disk group disk group disk group
Disk groups contribute to single vSAN datastore in vSphere cluster
Cache
Capacity
vSAN Datastore
§ Max 64 nodes
§ Min 2 nodes (ROBO)
§ Max 5 Disk Groups per
host
§ 2 – Tiers per Disk Group
vSAN very quick Overview
vSAN Datastore
§ Pools local storage intoa
single resource pool
§ Delivers Enterprise grade
scale & performance
§ Managed through policies
§ Integrates compute& storage
management into a single
pane
vSAN Component Layout
VMDK (512GB)
R1
R0 R0
C1 C2(components)
(RAID-1)
(RAID-0)
C1 C2(components)
R0
C1 C2(components)
HFT =2, FTM = RAID-1 , Stripe Width = 2
Note: No blocks are allocated at this time
(SIZE = 256GB)
Witness components not shown
(RAID-0) (RAID-0)
Each replica on different Fault Domain (e.g. host)
R1
R0 R0
C1 C2(components)
(RAID-1)
(RAID-0)
C1 C2(components)
R0
C1 C2(components)
HFT =2, FTM = RAID-1 , Stripe Width = 2
(SIZE = 256GB)
Witness components not shown
VMDK (512GB)
(RAID-0) (RAID-0)
(BLOCKS:4MB)
7©2018 VMware, Inc.
CMMDS: Maintains inventory of all things vSAN
C: Cluster
M: Membership
M: Monitoring
D: Directory
S: Service
v Distributed Directory Service
v In-memory
v Persisted on disk
v Elects Object Owner
​vSAN Objects and
Placement
​Storage
Policies
​RAID
Configurations
​Cluster Membership
8©2018 VMware, Inc.
Master receives updates from all other nodes
Backup Node
Agent Node
Master Node
​vSAN Objects and
Placement
​Storage
Policies
​RAID
Configurations
Receives updates
from
all hosts in the
cluster
Other Nodes subscribe for object
specific updates
​Cluster Membership
9©2018 VMware, Inc.
CLOM : ensures object has configuration that matches the policy
CLOM: cluster level object manager
C
C
C
C
C
CLOM: Cluster level Object Manager
v One per node
v Find placement configuration that will meet the
policy
v Needs to be aware of the placement of all objects on
the node
v Communicates with CMMDS service running on the
same node
Master
C C
C C
10©2018 VMware, Inc.
DOM : Manages I/O flow from VM
DOM: Distributed Object Manager
C
C
C
C
DOM : Distributed Object Manager
v One per object
v Implements the placement configuration prescribed by
CLOM
v Ensure Object consistency (creation, rebuild &
reconfiguration)
v Implements distributed RAID logic
Master
One per Object
C C
C C
11©2018 VMware, Inc.
Schematic Representation of single VMDK deployment
Steady state layout
C1
C2
C1
C2
W
Master
DOM:Distributed object owner
12©2018 VMware, Inc.
Schematic Representation of single VMDK deployment
Each partition elects its CMMDS
C1
C2
C1
C2
W
Master
DOM:Distributed object owner Each partition elects it’s Master
Objects has Quorum
andAvailability
1
2
Partition-01 Partition-02
DOM owner created
In-accessible state
13©2018 VMware, Inc.
Schematic Representation of single VMDK deployment
Each partition elects its CMMDS when there is a network partition
C1
C2
C1
C2
W
Master
DOM : Distributed object owner 1
2
VM HA to the partition that meets the
liveness criteria
Partition meets the liveness
criteria for object
Partition-01 Partition-02
Each partition elects it’s Master
Object has Quorum
& Availability
Agenda
14©2018 VMware, Inc.
​vSAN quick deep dive
​I/O flow
​Resync
​Availability
15©2018 VMware, Inc.
All Flash I/O flow: architecturallayout
H1 H2 H3
VMDK
Cache Tier
Capacity Tier
Replica -1 Replica -2
Capacity Tier
DOM:Distributed object owner
16©2018 VMware, Inc.
All Flash I/O flow: DOM and LSOM
H1 H2 H3
VMDK
Cache Tier
Capacity Tier
Replica -1 Replica -2
DOM:Distributed object owner
Log structured object manager
17©2018 VMware, Inc.
All Flash I/O flow: I/O issued by VM
H1 H1
1
VMDK vSAN Object
VM issues write DOM: one per object
VM DOM LSOM
18©2018 VMware, Inc.
All Flash I/O flow: DOM checks for free space
H1 H1
1
VMDK vSAN Object
2
VM issues write
v Check for conflicting I/Os on the
same I/O range
v and serialize the request
VM DOM LSOM
19©2018 VMware, Inc.
All Flash I/O flow: DOM sends prepare request to LSOM
H1 H1
1 VM issues write
VMDK vSAN Object
2
Check for conflicting I/Os
3
3
Send prepare
request to LSOM
VM DOM LSOM
20©2018 VMware, Inc.
All Flash I/O flow: LSOM commits to cache
H1 H1
1 VM issues write
VMDK vSAN Object
2
Check for conflicting I/Os
3
3
Send prepare request
to LSOM
4
v LSOM commits to cache
v No Dedupe
4
VM DOM LSOM
21©2018 VMware, Inc.
All Flash I/O flow: CMMDS master is not on the I/O path
H1 H1
1 VM issues write
VMDK vSAN Object
2
Check for conflicting I/Os
3
3
Send prepare request
to LSOM
4
4
VM DOM LSOM
I/O flow doesn’t go through the CMMDS master
LSOM commits
to cache
22©2018 VMware, Inc.
All Flash I/O flow: I/O ack propagated back to VM
H1 H1
1 VM issues write
VMDK vSAN Object
2
Check for conflicting I/Os
3
3
Send prepare request
to LSOM
4
LSOM commits
to cache
4
5
SendsAck back
to DOM
6
SendsAck back
to VM
VM DOM LSOM
23©2018 VMware, Inc.
All Flash I/O flow: DOM sends ack back to LSOM
H1 H1
1 VM issues write
VMDK vSAN Object
2
3
3
Send prepare request
to LSOM
4
LSOM commits
to cache
4
5
SendsAck back
to DOM6
SendsAck back
to VM
VM DOM LSOM
Check for conflicting I/Os
DOM sends ack back to LSOM7
24©2018 VMware, Inc.
All Flash I/O flow: Elevator de-stages to capacity
VMDK vSAN Object
1 Block Allocation:
Is Allocated?
Over-write block
Allocate logical
block at 4MB chunk
NOYes
2 Dedupe, compress, encrypt
3 Write to media @ 4KB chunk
Agenda
25©2018 VMware, Inc.
​vSAN quick deep dive
​I/O flow
​Resync
Availability
26©2018 VMware, Inc.
Schematic representation of how Resync works
Example of full Resync
R1
R0 R0
W
C2C2C2C2
Degraded state
Witness component
27©2018 VMware, Inc.
Schematic representation of how Resync works
Full Resync is initiated
R1
R0 R0
W
C2C2C2C2
Witness component
A
R0
C1 C2
Begin Resync
Begin ResyncDegraded state
28©2018 VMware, Inc.
Schematic representation of how Resync works
Full Resync completesand degradedcomponent is marked for deletion
R1
R0 R0
W
C2C2C2C2
Witness component
R0
C1 C2
Marked for deletion
B Resync completes &
degraded components are marked for deletion
Degraded state
29©2018 VMware, Inc.
Schematic representation of how Resync works
By contrast partial rebuilds have fewer blocks to resync
R1
R0 R0
W
C2C2C2C2
Degraded state
Witness component
Partial Repair
30©2018 VMware, Inc.
Examples of partial rebuild
R0
C2C2
Degraded state
Partial Repair
R0
C2C2
A
R0
C1 C2
Begin Resync
Begin Resync
Partial Rebuild Full Rebuild
Host comes out of maintenance mode
Recovery from transient failure
Partial or full
reconstruction
of RAID tree
v Block level copy
v No RAID tree construction
31©2018 VMware, Inc.
Examples of rebuilds
R0
C2C2
Degraded state
Partial Repair
R0
C2C2
A
R0
C1 C2
Begin Resync
Begin Resync
Partial Rebuild Full Rebuild
Host comes out of maintenance mode
Recovery from transient failure
Permanent disk or host failure
Disk Rebalancing
Delta Writes
32©2018 VMware, Inc.
Finally changing storage config is full rebuild
R0
C2C2
Degraded state
Partial Repair
R0
C2C2
A
R0
C1 C2
Begin Resync
Begin Resync
Partial Rebuild Full Rebuild
Host comes out of maintenance mode
Recovery from transient failure
Permanent disk or host failure
Disk Rebalancing
Delta Writes
Storage policy change
Agenda
33©2018 VMware, Inc.
​vSAN quick deep dive
​I/O flow
​Resync
Availability
34©2018 VMware, Inc.
First permanent failure initiates rebuild
Replica -1 Replica -2
Replica -3
Event 1: The first host
is down
1
2 vSAN begins full
rebuild
35©2018 VMware, Inc.
Intuition on planning for Availability
Probability of Availability Impact is:
Joint probability of:
v First failure followed by
v at least 2 more failures before rebuild
completes
36©2018 VMware, Inc.
Factors affectingAvailability
Probability of
component failure
v Type of failure: disk,
disk group, server
v Size of the cluster
v MTBF ratings
37©2018 VMware, Inc.
Factors affectingAvailability
Probability of
component failure
v Scope of failure: disk,
disk group, server
v Size of the cluster
v MTBF ratings
Data to Resync
v Duration of failure:
permanent vs. transient
v Type of failure: disk,
disk group and server
38©2018 VMware, Inc.
Factors affectingAvailability
Probability of
component failure
v Type of failure: disk,
disk group, server
v Size of the cluster
v MTBF ratings
Data to Resync
v Duration of failure:
permanent vs. transient
v Type of failure: disk,
disk group and server
Time to Resync
v Size of Cluster: larger
cluster have higher
resync parallelization
v Resync bandwidth
allocation
39©2018 VMware, Inc.
v Select enterprise grade drives
with higher endurance and
higher MTBFs
v Degraded device handling
Approaches to improvingAvailability (and Durability)
​Reduce Component Failures
40©2018 VMware, Inc.
v Select enterprise grade drives
with higher endurance and
higher MTBFs
v Degraded device handling
v CLOM repair delay settings
v Avoid policy changes
v Point Fix
v Smart Repairs
v What-if Assessments
Approaches to improvingAvailability (and Durability)
​Reduce Component Failures ​Amount of data to Resync
41©2018 VMware, Inc.
v Select enterprise grade drives
with higher endurance and
higher MTBFs
v Degraded device handling
v CLOM repair delay settings
v Avoid policy changes
v Point Fix
v Smart Repairs
v What-if Assessments
v Adaptive Resynchronization
v General performance
Improvements
Approaches to improvingAvailability (and Durability)
​Reduce Component Failures ​Amount of data to Resync ​Resync ETAs
42©2018 VMware, Inc.
Performance Deep Dive
Agenda
• Performance Fundamentals
• Adaptive Resync Architecture
• Monitoring Tools
43©2018 VMware, Inc.
Write BufferArchitecture
​Writes go to a first tier device in a fast sequential log
​Native device bandwidth to absorb short bursts
​Cold data is deduplicated and compressed as it moves out to
second tier
Guest Writes First Tier
Capacity Tier
destaging
44©2018 VMware, Inc.
Write BufferArchitecture
​Writes go to a first tier device in a fast sequential log
​Native device bandwidth to absorb short bursts
​Cold data is deduplicated and compressed as it moves out to
second tier
This de-staging process is slower than first tier writes
If we have sustained write workloads, we need to smoothly find
equilibrium
Guest Writes First Tier
Capacity Tier
destaging
Time
Bandwidth
1st
Tier Bandwidth
Capacity Tier Bandwidth
45©2018 VMware, Inc.
Congestion In Action (Pre-Adaptive Resync)
​We make this transition via a congestion signal
​Congestion is adaptive – apply a greater throttle until we reach
equilibrium
​Congestion stops rising when incoming rate equals de-staging
rate
Guest Writes First Tier
Capacity Tier
destaging
Time
Bandwidth
1st
Tier Bandwidth
Capacity Tier Bandwidth
CongestionEquilibrium
46©2018 VMware, Inc.
​Storage devices have some parallelism, but thereis a limit
​At first, more outstanding IO means more bandwidth (same latency)
​Once we hit max parallelism, more outstanding IO means more latency (same bandwidth)
Queueing Delay
Is high latency a hardware problem or a sizing problem?
Outstanding IO
Bandwidth
Outstanding IO
Latency
​Storage devices have some parallelism, but thereis a limit
​At first, more outstanding IO means more bandwidth (same latency)
​Once we hit max parallelism, more outstanding IO means more latency (same bandwidth)
​Storage devices have some parallelism, but thereis a limit
​At first, more outstanding IO means more bandwidth (same latency)
​Once we hit max parallelism, more outstanding IO means more latency (same bandwidth)
47©2018 VMware, Inc.
​Often high latency is the most visible symptom
Queueing Delay
Is high latency a hardware problem or a sizing problem?
Outstanding IO
Bandwidth
Outstanding IO
Latency
​Did we push the system to far?
​Or is there an issue with hardware
48©2018 VMware, Inc.
​Before: The more resyncs were happening, the larger the share of destage bandwidth.
• Many resyncs + low workload → drive up latency of vm IO
• Few resyncs + high workload → resync takes a long time
​Adaptive Resync: resync should get 20% of the bandwidth (if contended)
• We can use more if the guest IO is underutilizing the device
​Upgrade, policy change, rebalance should not be scary or take too long due to unfairness.
Adaptive Resync Customer Visible Before-and-After
49©2018 VMware, Inc.
​We are using Congestion to provide three different properties:
• Discover the bandwidth of the devices
• Fairly balance different classes of IO (80% guest IO, 20% resync IO)
• Push back on clients to slow down
​New approach: have a separate layer for
each guarantee.
What does Congestion try to do beforeAdaptive Resync
Bandwidth Regulator
Fairness Scheduler
Back Pressure
Backend
50©2018 VMware, Inc.
Adaptive Resync Deep Dive
​Per Disk-Group scheduler
​Bandwidth regulator discover the destaging rate
• Adaptive signal: write buffer fill
• Adaptive throttle: bandwidth limit
​Dispatch Scheduler fairly balances different
classes of IO
• (80% guest IO, 20% resync IO)
​Back pressure congestion pushes back on clients to
slow down
• Adaptive signal: scheduler queue fill
• Adaptive throttle: latency per op
Bandwith RegulatorDOM
LSOM Fullness signal
(LSOM congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Adaptive
Adaptive
51©2018 VMware, Inc.
​We want to fairly share the write buffer
​Adaptively discover the bandwidth
• Adding latency is not fair
​Fairly share between IO Classes
• Resync
• VM
• Namespace
​Easy because you can see what’s waiting.
​Difficult to share bandwidth across hosts
​Can’t see across the wire into what’s waiting on the
other side
​Need to allocate and reclaim shares.
• Complex timing based
​Instead we use latency
• Don’t need to see what’s waiting
​Manage Write Buffer Fullness ​Put Backpressure on the Clients
The Technical Challenges
52©2018 VMware, Inc.
And you can monitor this all in
vSphere
We’ll show the graphs at every layer
53©2018 VMware, Inc.
Cluster LevelView
Sequential Write Workload
54©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
55©2018 VMware, Inc.
Virtual MachineView
Sequential Write Workload
56©2018 VMware, Inc.
xxx
xxx
xxx
xxx
xxx
57©2018 VMware, Inc.
58©2018 VMware, Inc.
Diving into the backend
Answer the following questions:
• Too many Outstanding IO?
• Is it first tier latency?
• Is it de-staging latency?
• Device or Network Issue?
59©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
The top half shows if we have
too much Outstanding IO
60©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
This is where we can see if it is
a sizing issue (too much IO
queuing up)
61©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
A very high amount of
Outstanding IO causes
backpressure congestion
62©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Backend = Latency including
queues and below
63©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Disk Groups are where we
see first tier latency
64©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
This is where we see the de-
stage rate
65©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Disk Groups congestion
shows the signal from
LSOM
66©2018 VMware, Inc.
Diagram
Bandwith RegulatorDOM
LSOM Fullness signal
(congestion)
Dispatch Scheduler
Queues generate
Back-pressure
Clients
Back-pressure
Congestion
WB Fill
Disk Groups congestion
comes from WB Fill Also:
• Many log entries (small
writes, many objects)
• Component Congestion
(small writes, one object)
• Memory usage (rare)
67©2018 VMware, Inc.
Diving into the backend
Answer the following questions:
• Is it first tier performance?
• Is it de-staging performance?
• Too many Outstanding IO?
• Device or Network Issue?
68©2018 VMware, Inc.
©2018 VMware, Inc. 69
What about resync fairness?
70©2018 VMware, Inc.
• Should be in 4:1 ratio
• Ratio is measured on
normalized bandwidth (penalty
for small IOs)
• If one type is not using the
whole bandwidth, he other can
claim the leftover
71©2018 VMware, Inc.
Resync Fairness Applies even
when we have congestion
72©2018 VMware, Inc.
​Now you can upgrade and do
maintenancewith peace of mind
73©2018 VMware, Inc.
Get Ahead of the Curve – vSAN Private Beta
​vSAN Data Protection
​Native enterprise-grade
protection
​vSAN File Services
​Expanding vSAN beyond
block storage
​Cloud Native Storage
​Persistent storage for
containers
Sign up at http://www.vmware.com/go/vsan-beta
vSAN Performance and Resiliency at Scale
vSAN Performance and Resiliency at Scale
vSAN Performance and Resiliency at Scale

Más contenido relacionado

La actualidad más candente

Dell VMware Virtual SAN Ready Nodes
Dell VMware Virtual SAN Ready NodesDell VMware Virtual SAN Ready Nodes
Dell VMware Virtual SAN Ready Nodes
Andrew McDaniel
 
Introdução ao windows server
Introdução ao windows serverIntrodução ao windows server
Introdução ao windows server
GuiTelmoRicardo
 

La actualidad más candente (20)

VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...
VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...
VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...
 
vSAN architecture components
vSAN architecture componentsvSAN architecture components
vSAN architecture components
 
What’s New in VMware vSphere 7?
What’s New in VMware vSphere 7?What’s New in VMware vSphere 7?
What’s New in VMware vSphere 7?
 
Vmware training presentation
Vmware training presentationVmware training presentation
Vmware training presentation
 
VMware Virtual SAN Presentation
VMware Virtual SAN PresentationVMware Virtual SAN Presentation
VMware Virtual SAN Presentation
 
VMware Tanzu Introduction
VMware Tanzu IntroductionVMware Tanzu Introduction
VMware Tanzu Introduction
 
Dell VMware Virtual SAN Ready Nodes
Dell VMware Virtual SAN Ready NodesDell VMware Virtual SAN Ready Nodes
Dell VMware Virtual SAN Ready Nodes
 
VMware vSphere 6.0 - Troubleshooting Training - Day 5
VMware vSphere 6.0 - Troubleshooting Training - Day 5VMware vSphere 6.0 - Troubleshooting Training - Day 5
VMware vSphere 6.0 - Troubleshooting Training - Day 5
 
Integrating Linux Systems with Active Directory Using Open Source Tools
Integrating Linux Systems with Active Directory Using Open Source ToolsIntegrating Linux Systems with Active Directory Using Open Source Tools
Integrating Linux Systems with Active Directory Using Open Source Tools
 
VMware NSX 101: What, Why & How
VMware NSX 101: What, Why & HowVMware NSX 101: What, Why & How
VMware NSX 101: What, Why & How
 
Designing your XenApp 7.5 Environment
Designing your XenApp 7.5 EnvironmentDesigning your XenApp 7.5 Environment
Designing your XenApp 7.5 Environment
 
From Pivotal to VMware Tanzu: What you need to know
From Pivotal to VMware Tanzu: What you need to knowFrom Pivotal to VMware Tanzu: What you need to know
From Pivotal to VMware Tanzu: What you need to know
 
VMware vSAN - Novosco, June 2017
VMware vSAN - Novosco, June 2017VMware vSAN - Novosco, June 2017
VMware vSAN - Novosco, June 2017
 
VMware Cloud on AWS 環境のデータ保護ベスト・プラクティスご紹介!!
VMware Cloud on AWS 環境のデータ保護ベスト・プラクティスご紹介!!VMware Cloud on AWS 環境のデータ保護ベスト・プラクティスご紹介!!
VMware Cloud on AWS 環境のデータ保護ベスト・プラクティスご紹介!!
 
Red Hat Insights
Red Hat InsightsRed Hat Insights
Red Hat Insights
 
Presentation v mware virtual san 6.0
Presentation   v mware virtual san 6.0Presentation   v mware virtual san 6.0
Presentation v mware virtual san 6.0
 
NSX-T Architecture and Components.pptx
NSX-T Architecture and Components.pptxNSX-T Architecture and Components.pptx
NSX-T Architecture and Components.pptx
 
Introdução ao windows server
Introdução ao windows serverIntrodução ao windows server
Introdução ao windows server
 
Hci solution with VxRail
Hci solution with VxRailHci solution with VxRail
Hci solution with VxRail
 
What is the Citrix?
What is the Citrix?What is the Citrix?
What is the Citrix?
 

Similar a vSAN Performance and Resiliency at Scale

Storage and hyper v - the choices you can make and the things you need to kno...
Storage and hyper v - the choices you can make and the things you need to kno...Storage and hyper v - the choices you can make and the things you need to kno...
Storage and hyper v - the choices you can make and the things you need to kno...
Louis Göhl
 
Hyper-V Best Practices & Tips and Tricks
Hyper-V Best Practices & Tips and TricksHyper-V Best Practices & Tips and Tricks
Hyper-V Best Practices & Tips and Tricks
Amit Gatenyo
 

Similar a vSAN Performance and Resiliency at Scale (20)

Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
infraxstructure: Stas Levitan, "Always On" business in cloud - 2016"
infraxstructure: Stas Levitan, "Always On" business in cloud - 2016"infraxstructure: Stas Levitan, "Always On" business in cloud - 2016"
infraxstructure: Stas Levitan, "Always On" business in cloud - 2016"
 
Containerize Legacy .NET Framework Web Apps for Cloud Migration - ENT201 - Ch...
Containerize Legacy .NET Framework Web Apps for Cloud Migration - ENT201 - Ch...Containerize Legacy .NET Framework Web Apps for Cloud Migration - ENT201 - Ch...
Containerize Legacy .NET Framework Web Apps for Cloud Migration - ENT201 - Ch...
 
Containerize Legacy .NET Framework Web Apps for Cloud Migration
Containerize Legacy .NET Framework Web Apps for Cloud Migration Containerize Legacy .NET Framework Web Apps for Cloud Migration
Containerize Legacy .NET Framework Web Apps for Cloud Migration
 
Building enterprise class disaster recovery as a service to aws - session spo...
Building enterprise class disaster recovery as a service to aws - session spo...Building enterprise class disaster recovery as a service to aws - session spo...
Building enterprise class disaster recovery as a service to aws - session spo...
 
Containerize Legacy .NET Framework Web Apps for Cloud Migration (WIN305) - AW...
Containerize Legacy .NET Framework Web Apps for Cloud Migration (WIN305) - AW...Containerize Legacy .NET Framework Web Apps for Cloud Migration (WIN305) - AW...
Containerize Legacy .NET Framework Web Apps for Cloud Migration (WIN305) - AW...
 
Run Stateful Apps on Kubernetes with VMware PKS - Highlight WebLogic Server
Run Stateful Apps on Kubernetes with VMware PKS - Highlight WebLogic Server Run Stateful Apps on Kubernetes with VMware PKS - Highlight WebLogic Server
Run Stateful Apps on Kubernetes with VMware PKS - Highlight WebLogic Server
 
Monitoring CloudStack and components
Monitoring CloudStack and componentsMonitoring CloudStack and components
Monitoring CloudStack and components
 
Connectivity Options for VMware Cloud on AWS Software Defined Data Centers (S...
Connectivity Options for VMware Cloud on AWS Software Defined Data Centers (S...Connectivity Options for VMware Cloud on AWS Software Defined Data Centers (S...
Connectivity Options for VMware Cloud on AWS Software Defined Data Centers (S...
 
Running Production Workloads in VMware Cloud on AWS (ENT313-S) - AWS re:Inven...
Running Production Workloads in VMware Cloud on AWS (ENT313-S) - AWS re:Inven...Running Production Workloads in VMware Cloud on AWS (ENT313-S) - AWS re:Inven...
Running Production Workloads in VMware Cloud on AWS (ENT313-S) - AWS re:Inven...
 
ENT208 Transform your Business with VMware Cloud on AWS
ENT208 Transform your Business with VMware Cloud on AWSENT208 Transform your Business with VMware Cloud on AWS
ENT208 Transform your Business with VMware Cloud on AWS
 
EMC VSPEX for Virtualizing Your Data Center
EMC VSPEX for Virtualizing Your Data CenterEMC VSPEX for Virtualizing Your Data Center
EMC VSPEX for Virtualizing Your Data Center
 
Zerto Virtual Replication 4.5
Zerto Virtual Replication 4.5Zerto Virtual Replication 4.5
Zerto Virtual Replication 4.5
 
Storage and hyper v - the choices you can make and the things you need to kno...
Storage and hyper v - the choices you can make and the things you need to kno...Storage and hyper v - the choices you can make and the things you need to kno...
Storage and hyper v - the choices you can make and the things you need to kno...
 
The Unofficial VCAP / VCP VMware Study Guide
The Unofficial VCAP / VCP VMware Study GuideThe Unofficial VCAP / VCP VMware Study Guide
The Unofficial VCAP / VCP VMware Study Guide
 
Hyper-V Best Practices & Tips and Tricks
Hyper-V Best Practices & Tips and TricksHyper-V Best Practices & Tips and Tricks
Hyper-V Best Practices & Tips and Tricks
 
AWS Summit Auckland - Sponsor Presentation - Zerto
AWS Summit Auckland - Sponsor Presentation - ZertoAWS Summit Auckland - Sponsor Presentation - Zerto
AWS Summit Auckland - Sponsor Presentation - Zerto
 
ZERTO Introduction to End User Presentation
ZERTO Introduction to End User PresentationZERTO Introduction to End User Presentation
ZERTO Introduction to End User Presentation
 
VMware HCI solutions - 2020-01-16
VMware HCI solutions - 2020-01-16VMware HCI solutions - 2020-01-16
VMware HCI solutions - 2020-01-16
 
Load Balancing, Failover and Scalability with ColdFusion
Load Balancing, Failover and Scalability with ColdFusionLoad Balancing, Failover and Scalability with ColdFusion
Load Balancing, Failover and Scalability with ColdFusion
 

Último

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 

Último (20)

10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 

vSAN Performance and Resiliency at Scale

  • 1. vSAN Resiliency and Performance @ Scale Sumit Lahiri Product Line Manager Eric Knauft Staff Engineer #vmworld#HCI2427BU HCI2427BU
  • 2. Agenda 2©2018 VMware, Inc. ​vSAN quick deep dive ​I/O flow ​Resynchronization ​Availability ​Performance Deep Dive : Eric
  • 3. Disk layout in a single vSAN server disk groupdisk group disk group disk group disk group Disk groups contribute to single vSAN datastore in vSphere cluster Cache Capacity vSAN Datastore § Max 64 nodes § Min 2 nodes (ROBO) § Max 5 Disk Groups per host § 2 – Tiers per Disk Group
  • 4. vSAN very quick Overview vSAN Datastore § Pools local storage intoa single resource pool § Delivers Enterprise grade scale & performance § Managed through policies § Integrates compute& storage management into a single pane
  • 5. vSAN Component Layout VMDK (512GB) R1 R0 R0 C1 C2(components) (RAID-1) (RAID-0) C1 C2(components) R0 C1 C2(components) HFT =2, FTM = RAID-1 , Stripe Width = 2 Note: No blocks are allocated at this time (SIZE = 256GB) Witness components not shown (RAID-0) (RAID-0)
  • 6. Each replica on different Fault Domain (e.g. host) R1 R0 R0 C1 C2(components) (RAID-1) (RAID-0) C1 C2(components) R0 C1 C2(components) HFT =2, FTM = RAID-1 , Stripe Width = 2 (SIZE = 256GB) Witness components not shown VMDK (512GB) (RAID-0) (RAID-0) (BLOCKS:4MB)
  • 7. 7©2018 VMware, Inc. CMMDS: Maintains inventory of all things vSAN C: Cluster M: Membership M: Monitoring D: Directory S: Service v Distributed Directory Service v In-memory v Persisted on disk v Elects Object Owner ​vSAN Objects and Placement ​Storage Policies ​RAID Configurations ​Cluster Membership
  • 8. 8©2018 VMware, Inc. Master receives updates from all other nodes Backup Node Agent Node Master Node ​vSAN Objects and Placement ​Storage Policies ​RAID Configurations Receives updates from all hosts in the cluster Other Nodes subscribe for object specific updates ​Cluster Membership
  • 9. 9©2018 VMware, Inc. CLOM : ensures object has configuration that matches the policy CLOM: cluster level object manager C C C C C CLOM: Cluster level Object Manager v One per node v Find placement configuration that will meet the policy v Needs to be aware of the placement of all objects on the node v Communicates with CMMDS service running on the same node Master C C C C
  • 10. 10©2018 VMware, Inc. DOM : Manages I/O flow from VM DOM: Distributed Object Manager C C C C DOM : Distributed Object Manager v One per object v Implements the placement configuration prescribed by CLOM v Ensure Object consistency (creation, rebuild & reconfiguration) v Implements distributed RAID logic Master One per Object C C C C
  • 11. 11©2018 VMware, Inc. Schematic Representation of single VMDK deployment Steady state layout C1 C2 C1 C2 W Master DOM:Distributed object owner
  • 12. 12©2018 VMware, Inc. Schematic Representation of single VMDK deployment Each partition elects its CMMDS C1 C2 C1 C2 W Master DOM:Distributed object owner Each partition elects it’s Master Objects has Quorum andAvailability 1 2 Partition-01 Partition-02 DOM owner created In-accessible state
  • 13. 13©2018 VMware, Inc. Schematic Representation of single VMDK deployment Each partition elects its CMMDS when there is a network partition C1 C2 C1 C2 W Master DOM : Distributed object owner 1 2 VM HA to the partition that meets the liveness criteria Partition meets the liveness criteria for object Partition-01 Partition-02 Each partition elects it’s Master Object has Quorum & Availability
  • 14. Agenda 14©2018 VMware, Inc. ​vSAN quick deep dive ​I/O flow ​Resync ​Availability
  • 15. 15©2018 VMware, Inc. All Flash I/O flow: architecturallayout H1 H2 H3 VMDK Cache Tier Capacity Tier Replica -1 Replica -2 Capacity Tier DOM:Distributed object owner
  • 16. 16©2018 VMware, Inc. All Flash I/O flow: DOM and LSOM H1 H2 H3 VMDK Cache Tier Capacity Tier Replica -1 Replica -2 DOM:Distributed object owner Log structured object manager
  • 17. 17©2018 VMware, Inc. All Flash I/O flow: I/O issued by VM H1 H1 1 VMDK vSAN Object VM issues write DOM: one per object VM DOM LSOM
  • 18. 18©2018 VMware, Inc. All Flash I/O flow: DOM checks for free space H1 H1 1 VMDK vSAN Object 2 VM issues write v Check for conflicting I/Os on the same I/O range v and serialize the request VM DOM LSOM
  • 19. 19©2018 VMware, Inc. All Flash I/O flow: DOM sends prepare request to LSOM H1 H1 1 VM issues write VMDK vSAN Object 2 Check for conflicting I/Os 3 3 Send prepare request to LSOM VM DOM LSOM
  • 20. 20©2018 VMware, Inc. All Flash I/O flow: LSOM commits to cache H1 H1 1 VM issues write VMDK vSAN Object 2 Check for conflicting I/Os 3 3 Send prepare request to LSOM 4 v LSOM commits to cache v No Dedupe 4 VM DOM LSOM
  • 21. 21©2018 VMware, Inc. All Flash I/O flow: CMMDS master is not on the I/O path H1 H1 1 VM issues write VMDK vSAN Object 2 Check for conflicting I/Os 3 3 Send prepare request to LSOM 4 4 VM DOM LSOM I/O flow doesn’t go through the CMMDS master LSOM commits to cache
  • 22. 22©2018 VMware, Inc. All Flash I/O flow: I/O ack propagated back to VM H1 H1 1 VM issues write VMDK vSAN Object 2 Check for conflicting I/Os 3 3 Send prepare request to LSOM 4 LSOM commits to cache 4 5 SendsAck back to DOM 6 SendsAck back to VM VM DOM LSOM
  • 23. 23©2018 VMware, Inc. All Flash I/O flow: DOM sends ack back to LSOM H1 H1 1 VM issues write VMDK vSAN Object 2 3 3 Send prepare request to LSOM 4 LSOM commits to cache 4 5 SendsAck back to DOM6 SendsAck back to VM VM DOM LSOM Check for conflicting I/Os DOM sends ack back to LSOM7
  • 24. 24©2018 VMware, Inc. All Flash I/O flow: Elevator de-stages to capacity VMDK vSAN Object 1 Block Allocation: Is Allocated? Over-write block Allocate logical block at 4MB chunk NOYes 2 Dedupe, compress, encrypt 3 Write to media @ 4KB chunk
  • 25. Agenda 25©2018 VMware, Inc. ​vSAN quick deep dive ​I/O flow ​Resync Availability
  • 26. 26©2018 VMware, Inc. Schematic representation of how Resync works Example of full Resync R1 R0 R0 W C2C2C2C2 Degraded state Witness component
  • 27. 27©2018 VMware, Inc. Schematic representation of how Resync works Full Resync is initiated R1 R0 R0 W C2C2C2C2 Witness component A R0 C1 C2 Begin Resync Begin ResyncDegraded state
  • 28. 28©2018 VMware, Inc. Schematic representation of how Resync works Full Resync completesand degradedcomponent is marked for deletion R1 R0 R0 W C2C2C2C2 Witness component R0 C1 C2 Marked for deletion B Resync completes & degraded components are marked for deletion Degraded state
  • 29. 29©2018 VMware, Inc. Schematic representation of how Resync works By contrast partial rebuilds have fewer blocks to resync R1 R0 R0 W C2C2C2C2 Degraded state Witness component Partial Repair
  • 30. 30©2018 VMware, Inc. Examples of partial rebuild R0 C2C2 Degraded state Partial Repair R0 C2C2 A R0 C1 C2 Begin Resync Begin Resync Partial Rebuild Full Rebuild Host comes out of maintenance mode Recovery from transient failure Partial or full reconstruction of RAID tree v Block level copy v No RAID tree construction
  • 31. 31©2018 VMware, Inc. Examples of rebuilds R0 C2C2 Degraded state Partial Repair R0 C2C2 A R0 C1 C2 Begin Resync Begin Resync Partial Rebuild Full Rebuild Host comes out of maintenance mode Recovery from transient failure Permanent disk or host failure Disk Rebalancing Delta Writes
  • 32. 32©2018 VMware, Inc. Finally changing storage config is full rebuild R0 C2C2 Degraded state Partial Repair R0 C2C2 A R0 C1 C2 Begin Resync Begin Resync Partial Rebuild Full Rebuild Host comes out of maintenance mode Recovery from transient failure Permanent disk or host failure Disk Rebalancing Delta Writes Storage policy change
  • 33. Agenda 33©2018 VMware, Inc. ​vSAN quick deep dive ​I/O flow ​Resync Availability
  • 34. 34©2018 VMware, Inc. First permanent failure initiates rebuild Replica -1 Replica -2 Replica -3 Event 1: The first host is down 1 2 vSAN begins full rebuild
  • 35. 35©2018 VMware, Inc. Intuition on planning for Availability Probability of Availability Impact is: Joint probability of: v First failure followed by v at least 2 more failures before rebuild completes
  • 36. 36©2018 VMware, Inc. Factors affectingAvailability Probability of component failure v Type of failure: disk, disk group, server v Size of the cluster v MTBF ratings
  • 37. 37©2018 VMware, Inc. Factors affectingAvailability Probability of component failure v Scope of failure: disk, disk group, server v Size of the cluster v MTBF ratings Data to Resync v Duration of failure: permanent vs. transient v Type of failure: disk, disk group and server
  • 38. 38©2018 VMware, Inc. Factors affectingAvailability Probability of component failure v Type of failure: disk, disk group, server v Size of the cluster v MTBF ratings Data to Resync v Duration of failure: permanent vs. transient v Type of failure: disk, disk group and server Time to Resync v Size of Cluster: larger cluster have higher resync parallelization v Resync bandwidth allocation
  • 39. 39©2018 VMware, Inc. v Select enterprise grade drives with higher endurance and higher MTBFs v Degraded device handling Approaches to improvingAvailability (and Durability) ​Reduce Component Failures
  • 40. 40©2018 VMware, Inc. v Select enterprise grade drives with higher endurance and higher MTBFs v Degraded device handling v CLOM repair delay settings v Avoid policy changes v Point Fix v Smart Repairs v What-if Assessments Approaches to improvingAvailability (and Durability) ​Reduce Component Failures ​Amount of data to Resync
  • 41. 41©2018 VMware, Inc. v Select enterprise grade drives with higher endurance and higher MTBFs v Degraded device handling v CLOM repair delay settings v Avoid policy changes v Point Fix v Smart Repairs v What-if Assessments v Adaptive Resynchronization v General performance Improvements Approaches to improvingAvailability (and Durability) ​Reduce Component Failures ​Amount of data to Resync ​Resync ETAs
  • 42. 42©2018 VMware, Inc. Performance Deep Dive Agenda • Performance Fundamentals • Adaptive Resync Architecture • Monitoring Tools
  • 43. 43©2018 VMware, Inc. Write BufferArchitecture ​Writes go to a first tier device in a fast sequential log ​Native device bandwidth to absorb short bursts ​Cold data is deduplicated and compressed as it moves out to second tier Guest Writes First Tier Capacity Tier destaging
  • 44. 44©2018 VMware, Inc. Write BufferArchitecture ​Writes go to a first tier device in a fast sequential log ​Native device bandwidth to absorb short bursts ​Cold data is deduplicated and compressed as it moves out to second tier This de-staging process is slower than first tier writes If we have sustained write workloads, we need to smoothly find equilibrium Guest Writes First Tier Capacity Tier destaging Time Bandwidth 1st Tier Bandwidth Capacity Tier Bandwidth
  • 45. 45©2018 VMware, Inc. Congestion In Action (Pre-Adaptive Resync) ​We make this transition via a congestion signal ​Congestion is adaptive – apply a greater throttle until we reach equilibrium ​Congestion stops rising when incoming rate equals de-staging rate Guest Writes First Tier Capacity Tier destaging Time Bandwidth 1st Tier Bandwidth Capacity Tier Bandwidth CongestionEquilibrium
  • 46. 46©2018 VMware, Inc. ​Storage devices have some parallelism, but thereis a limit ​At first, more outstanding IO means more bandwidth (same latency) ​Once we hit max parallelism, more outstanding IO means more latency (same bandwidth) Queueing Delay Is high latency a hardware problem or a sizing problem? Outstanding IO Bandwidth Outstanding IO Latency ​Storage devices have some parallelism, but thereis a limit ​At first, more outstanding IO means more bandwidth (same latency) ​Once we hit max parallelism, more outstanding IO means more latency (same bandwidth) ​Storage devices have some parallelism, but thereis a limit ​At first, more outstanding IO means more bandwidth (same latency) ​Once we hit max parallelism, more outstanding IO means more latency (same bandwidth)
  • 47. 47©2018 VMware, Inc. ​Often high latency is the most visible symptom Queueing Delay Is high latency a hardware problem or a sizing problem? Outstanding IO Bandwidth Outstanding IO Latency ​Did we push the system to far? ​Or is there an issue with hardware
  • 48. 48©2018 VMware, Inc. ​Before: The more resyncs were happening, the larger the share of destage bandwidth. • Many resyncs + low workload → drive up latency of vm IO • Few resyncs + high workload → resync takes a long time ​Adaptive Resync: resync should get 20% of the bandwidth (if contended) • We can use more if the guest IO is underutilizing the device ​Upgrade, policy change, rebalance should not be scary or take too long due to unfairness. Adaptive Resync Customer Visible Before-and-After
  • 49. 49©2018 VMware, Inc. ​We are using Congestion to provide three different properties: • Discover the bandwidth of the devices • Fairly balance different classes of IO (80% guest IO, 20% resync IO) • Push back on clients to slow down ​New approach: have a separate layer for each guarantee. What does Congestion try to do beforeAdaptive Resync Bandwidth Regulator Fairness Scheduler Back Pressure Backend
  • 50. 50©2018 VMware, Inc. Adaptive Resync Deep Dive ​Per Disk-Group scheduler ​Bandwidth regulator discover the destaging rate • Adaptive signal: write buffer fill • Adaptive throttle: bandwidth limit ​Dispatch Scheduler fairly balances different classes of IO • (80% guest IO, 20% resync IO) ​Back pressure congestion pushes back on clients to slow down • Adaptive signal: scheduler queue fill • Adaptive throttle: latency per op Bandwith RegulatorDOM LSOM Fullness signal (LSOM congestion) Dispatch Scheduler Queues generate Back-pressure Clients Back-pressure Congestion WB Fill Adaptive Adaptive
  • 51. 51©2018 VMware, Inc. ​We want to fairly share the write buffer ​Adaptively discover the bandwidth • Adding latency is not fair ​Fairly share between IO Classes • Resync • VM • Namespace ​Easy because you can see what’s waiting. ​Difficult to share bandwidth across hosts ​Can’t see across the wire into what’s waiting on the other side ​Need to allocate and reclaim shares. • Complex timing based ​Instead we use latency • Don’t need to see what’s waiting ​Manage Write Buffer Fullness ​Put Backpressure on the Clients The Technical Challenges
  • 52. 52©2018 VMware, Inc. And you can monitor this all in vSphere We’ll show the graphs at every layer
  • 53. 53©2018 VMware, Inc. Cluster LevelView Sequential Write Workload
  • 54. 54©2018 VMware, Inc. Diagram Bandwith RegulatorDOM LSOM Fullness signal (congestion) Dispatch Scheduler Queues generate Back-pressure Clients Back-pressure Congestion WB Fill
  • 55. 55©2018 VMware, Inc. Virtual MachineView Sequential Write Workload
  • 58. 58©2018 VMware, Inc. Diving into the backend Answer the following questions: • Too many Outstanding IO? • Is it first tier latency? • Is it de-staging latency? • Device or Network Issue?
  • 59. 59©2018 VMware, Inc. Diagram Bandwith RegulatorDOM LSOM Fullness signal (congestion) Dispatch Scheduler Queues generate Back-pressure Clients Back-pressure Congestion WB Fill The top half shows if we have too much Outstanding IO
  • 60. 60©2018 VMware, Inc. Diagram Bandwith RegulatorDOM LSOM Fullness signal (congestion) Dispatch Scheduler Queues generate Back-pressure Clients Back-pressure Congestion WB Fill This is where we can see if it is a sizing issue (too much IO queuing up)
  • 61. 61©2018 VMware, Inc. Diagram Bandwith RegulatorDOM LSOM Fullness signal (congestion) Dispatch Scheduler Queues generate Back-pressure Clients Back-pressure Congestion WB Fill A very high amount of Outstanding IO causes backpressure congestion
  • 62. 62©2018 VMware, Inc. Diagram Bandwith RegulatorDOM LSOM Fullness signal (congestion) Dispatch Scheduler Queues generate Back-pressure Clients Back-pressure Congestion WB Fill Backend = Latency including queues and below
  • 63. 63©2018 VMware, Inc. Diagram Bandwith RegulatorDOM LSOM Fullness signal (congestion) Dispatch Scheduler Queues generate Back-pressure Clients Back-pressure Congestion WB Fill Disk Groups are where we see first tier latency
  • 64. 64©2018 VMware, Inc. Diagram Bandwith RegulatorDOM LSOM Fullness signal (congestion) Dispatch Scheduler Queues generate Back-pressure Clients Back-pressure Congestion WB Fill This is where we see the de- stage rate
  • 65. 65©2018 VMware, Inc. Diagram Bandwith RegulatorDOM LSOM Fullness signal (congestion) Dispatch Scheduler Queues generate Back-pressure Clients Back-pressure Congestion WB Fill Disk Groups congestion shows the signal from LSOM
  • 66. 66©2018 VMware, Inc. Diagram Bandwith RegulatorDOM LSOM Fullness signal (congestion) Dispatch Scheduler Queues generate Back-pressure Clients Back-pressure Congestion WB Fill Disk Groups congestion comes from WB Fill Also: • Many log entries (small writes, many objects) • Component Congestion (small writes, one object) • Memory usage (rare)
  • 67. 67©2018 VMware, Inc. Diving into the backend Answer the following questions: • Is it first tier performance? • Is it de-staging performance? • Too many Outstanding IO? • Device or Network Issue?
  • 69. ©2018 VMware, Inc. 69 What about resync fairness?
  • 70. 70©2018 VMware, Inc. • Should be in 4:1 ratio • Ratio is measured on normalized bandwidth (penalty for small IOs) • If one type is not using the whole bandwidth, he other can claim the leftover
  • 71. 71©2018 VMware, Inc. Resync Fairness Applies even when we have congestion
  • 72. 72©2018 VMware, Inc. ​Now you can upgrade and do maintenancewith peace of mind
  • 73. 73©2018 VMware, Inc. Get Ahead of the Curve – vSAN Private Beta ​vSAN Data Protection ​Native enterprise-grade protection ​vSAN File Services ​Expanding vSAN beyond block storage ​Cloud Native Storage ​Persistent storage for containers Sign up at http://www.vmware.com/go/vsan-beta