SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
David Lo, Dragos Sbirlea, Rohit Jnagal
Managing Memory Bandwidth Antagonism @ Scale
David Dragos & Rohit
Borg Model
● Large clusters with multi-tenant hosts.
● Run a mix of :
○ high and low priority workloads.
○ latency-sensitive and batch workloads.
● Isolation through bare-metal containers
(cgroups/namespaces)
○ Cgroups and perf to monitor host and job
performance.
○ Cgroups and h/w controls to manage
on-node performance.
○ Cluster scheduling and balancing manages
service performance.
Efficiency
Availability
Performance
3
The Memory Bandwidth Problem
● Large variation in performance
on multi-tenant hosts.
● On average, saturation events are
few, but:
○ Periodically causes significant
cluster-wide performance
degradation.
● Some workloads are much more
seriously affected than others.
○ Does not necessarily correlate
with victim’s memory bandwidth
use.
Latency
time
Antagonist task starts
4
Note : This talk is focussed on membw problem for general servers and
does not cover GPUs and other special devices. Similar techniques apply
there too.
Memory BW Saturation is Increasing Over Time
Nov
2018
5
Time
Fractionofmachineswithsaturation
Jan
2018
Fraction of machines that experienced mem BW saturation
● Large machines need to pack more jobs to maintain
utilization, resulting in more “noisy neighbor” problems.
Why It Is a (Bigger) Problem Now
● ML workloads are memory BW intensive
6
● Track per-socket local and remote memory bandwidth use
● Identify per-platform thresholds for performance dips (saturation)
● Characterize saturation by platform and clusters
Understanding the Scope : Socket-Level MonitoringMEM
Local
Write
Remote
MEM
LocalRemote
Read WriteWriteReadRead WriteRead
Socket 0 Socket 1
7
Saturation behavior varies with platform and cluster, due to
● hardware differences (membw/core ratio)
● workload (large CPU consumers run on bigger platforms)
Platform and Cluster Variation
8
By platform
By cluster
● Socket-level information gives the magnitude of the
problem and hot-spots
● Need task-level information to identify:
○ Abusers : tasks using disproportionate amount of
bandwidth
○ Victims : tasks seeing performance drop
● New platforms provide task-level memory bandwidth
monitoring, but:
○ RDT cgroup was on its way out
○ Have no data on older platforms
For our purposes, a rough attribution of memory bandwidth
was good enough
Monitoring Sockets ↣ Monitoring Tasks
Saturation threshold
9
Totalmemorybandwidth
MemoryBWbreakdown
● Summary of requirements:
○ Local and remote bandwidth breakdown
○ Compatible with with cgroup model
● What's available in hardware?
○ Uncore counters (IMC, CHA)
■ Difficult to attribute to HyperThread => cgroup
○ CPU PMU counters
■ Counters are HyperThread local
■ Works with cgroup profiling mode
D
D
R
I
M
C
CPU Core
CHA
HT0 HT1
CPU Core
CHA
HT0 HT1
Per-task Memory Bandwidth Estimation
10
● OFFCORE_RESPONSE for Intel CPUs
● Programmable filter to specify events of interest (i.e. DRAM local and DRAM remote)
● Captures both demand load and HW prefetcher traffic
● Online documentation of the meaning of bits, per CPU (download.01.org)
● How to interpret: cache lines / sec X 64b/cache line = BW
Intel SDM Vol 3
Which CPU Perfmon to Use?
11
Abuser insights
● Large percentage of time, a single consumer uses up most bandwidth.
● The share of CPU of that consumer are much lower than its share of membw.
Victim insight
● Many jobs are sensitive to membw saturation.
● Jobs are sensitive even though they are not big users of membw.
Guidance on enforcement options
● How much saturation would we avoid if we do X?
● Which jobs would get caught in the crossfire?
Insights from Task Measurement
CPI degradation on saturation
(as a fraction)
Numberofjobs
Combinations of jobs (by CPU
requirements) during saturation
12
Enforcement : Controlling Different Workloads
BW Usage
Priority
Moderate Heavy
LowMediumHigh
Isolate
Disable
ThrottleThrottle
Reactive rescheduling
Isolate
13
What Can We Do ? Node and Cluster Level Actuators
Node
Memory Bandwidth Allocation in hardware
Use HW QoS to apply max limits to tasks
overusing memory bandwidth.
CPU throttling for indirect control
Limit CPU access of over-using tasks to
indirectly limit the memory bandwidth used.
Cluster
Reactive evictions & re-scheduling
Hosts experiencing memory BW saturation
signals scheduler to re-distribute bigger memory
bandwidth users to lightly-loaded machines.
Disabling heavy antagonist workloads
Tasks that saturate a socket by itself cannot be
effectively redistributed. If slowing down is not
an option, de-schedule them.
14
+ Very effective in reducing saturation;
+ Works on all platforms
Node : CPU Throttling
Socket 0 (saturated) Socket 1
CPUs running memBW over-users
- Too coarse in granularity;
- Interacts poorly with Autoscaling & Load-balancing
15
Socket memory BW
saturation detector
Cgroup memory BW
estimator
Memory BW enforcer
Socket
perf counters
Every x seconds
If socket BW > saturation threshold
Socket, Cgroup
perf counters
Profile potentially
eligible tasks
Policy filter
CPU runnable mask
Select eligible tasks
for throttling
If socket BW < unthrottle threshold,
unthrottle tasks
16
Throttling - Enforcement Algorithm
Node : Memory Bandwidth Allocation
Intel RDT
Memory Bandwidth Allocation
+ Reduced bandwidth without lowering CPU
utilization.
+ Somewhat fine-grained than cpu-level
controls.
- Newer platforms only.
- Can’t isolate well between hyperthreads.
Supported through resctrl in kernel
(more on that later)
17
In many cases, there are:
● A low-percentage of saturated sockets in cluster, and
● Multiple tasks contributing to saturation.
Re-scheduling the tasks to less loaded machines can avoid
slow-downs.
Does not help with large antagonists that can saturate any
socket it runs on.
Cluster : Reactive Re-Scheduling
ObserverScheduler
host
A
host
B
host
C
host
D
saturated
1.Callforhelp
2.Evict
3.Reschedule
18
Low priority jobs can be dealt at node-level through throttling.
If SLOs do not permit throttling and the antagonists cannot be
redistributed :
● Disable (kick out of the cluster)
● Users can then reconfigure their service to use different product.
● Area of continual work.
Alternative :
● Colocate multiple antagonists (that’s just working around SLOs)
Handling Cluster-Wide Saturation
Cluster Membw distribution
amenable to rescheduling
Cluster Membw distribution
amenable to job disabling
Saturation
threshold
Saturation
threshold
19
Results : CPU Throttling + Rescheduling
20
Results : Rebalancing
21
● New, unified interface: resctrl
● resctrl is a big improvement over the previous non-standard cgroup interface
● Uniform way of monitoring/controlling HW QoS across vendors/architectures
○ AMD, ARM, Intel
● (Non-exhaustive) list of HW features supported:
○ Memory BW monitoring
○ Memory BW throttling
○ L3 cache usage monitoring
○ L3 cache partitioning
resctrl : HW QoS Support in Kernel
22
● Below is using x86 terminology
● CLass of Service ID (CLOSID): maps to a QoS configuration. Typically O(10) unique
ones in HW.
● Resource Monitoring ID (RMID): used to tag workloads and their used resources to
aggregate their resource usage. Typically O(100) unique ones in HW.
Intro to HW QoS Terms and Concepts
Hi priority (CLOSID 0)
100% L3 cache
100% mem BW
Low priority (CLOSID 1)
50% L3 cache
20% mem BW
RMID0 RMID1 RMID2 RMID3 RMID4
Workload A Workload B Workload C
23
resctrl/
|- groupA/
| |- mon_groups/
| | |- monA/
| | | |- mon_data/
| | | |- tasks
| | | |- ...
| | |- monB/
| | |- mon_data/
| | |- ...
| |- schemata
| |- tasks
| |- ...
|- groupB/
|- ...
Overview of resctrl Filesystem
Documentation: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
A resource control group. Represents one unique HW CLOSID.
A monitoring group. Represents one unique HW RMID.
TIDs in monitoring group
TIDs in resource control group
QoS configuration for resource control group
Resource usage data for entire resource control group
Resource usage data for monitoring group
24
Example Usage of resctrl Interfaces
$ cat groupA/schemata
L3:0=ff;1=ff
MB:0=90;1=90
$ READING0=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes)
$ sleep 1
$ READING1=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes)
$ echo $((READING1-READING0))
1816234126
Allowed to use 8 cache ways for L3 on both sockets.
Per-core memory BW constrained to 90% on both sockets.
Compute memory BW by taking a rate.
In this case, BW ~= 1.8GiB/s
25
Reconciling resctrl and cgroups: First Try
resctrl/
|- no_throttle/
| |- mon_groups/
| | |- cgroupX/
| | | |- mon_data/
| | | |- tasks
| | | |- ...
| | |- monB/
| | |- mon_data/
| | |- ...
| |- schemata
| |- tasks
| |- ...
|- bw_throttled/
|- ...
<< #1
<< #1
<< #1
<< #3
<< #5 ↻
<< #6 ↻
Use case: dynamically apply memory BW throttling if
machine is in trouble
1. Node SW creates 2 resctrl groups: no_throttle
and bw_throttled
2. On cgroup creation, logically assign cgroupX to
no_throttle
3. Create a mongroup for cgroupX in
no_throttle
4. Start cgroupX
5. Move TIDs into no_throttle/tasks
6. Move TIDs into
no_throttle/mon_groups/cgroupX/tasks
7. Move TIDs of high BW user into bw_throttled
26
Use case: dynamically apply memory BW throttling if
machine is in trouble
1. Node SW creates 2 resctrl groups: no_throttle
and bw_throttled
2. On cgroup creation, logically assign cgroupX to
no_throttle
3. Create a mongroup for cgroupX in
no_throttle
4. Start cgroupX
5. Move TIDs into no_throttle/tasks
6. Move TIDs into
no_throttle/mon_groups/cgroupX/tasks
7. Move TIDs of high BW user into bw_throttled
Challenges with Naive Approach
Race in moving TIDs if cgroup is
creating threads. Expensive if lots
of TIDs and to deal with the race.
Desynchronization of L3 cache
occupancy data, since existing
data is tagged with an old RMID.
27
● What if we had the ability to have a 1:1 mapping of cgroups to resctrl groups
○ To change QoS configs, just rewrite schemata
○ More efficient, remove need to move TIDs around
○ Keep existing RMID, prevent L3 occupancy desynchronization issue
○ 100% compatible with existing resctrl abstraction
● CHALLENGE: with existing system, will run out of CLOSIDs very quickly
● SOLUTION: share CLOSIDs between resource control groups with the same schemata
● Google-developed kernel patch for this functionality to be released soon
● Demonstrates need to make cgroup model a first class consideration for QoS
interfaces
A Better Approach for resctrl and cgroups
28
cgroups and resctrl with After the Change
resctrl/
|- cgroupX/
| |- mon_groups/
| | |- mon_data/
| | |- ...
| |- schemata
| |- tasks
| |- ...
|- high_bw_cgroup/
| |- schemata
| |- ...
|- ...
<< #1
<< #4 ↻
Use case: dynamically apply memory BW throttling if
machine is in trouble
1. Create a resctrl group cgroupX
2. Write no throttling configuration to
cgroupX/schemata
3. Start cgroupX
4. Move TIDs into cgroupX/tasks
5. Rewrite schemata of high BW using cgroup to
throttle
<< #2
<< #5
29
● Measuring µArch impact is not a first class component of
most container runtimes.
○ Can’t manage what we can’t see...
● Most container runtimes expose isolation knobs per
container.
● Managing µArch isolation requires node and cluster level
feedback-loops.
○ Dual operating mode : admins & users.
○ Performance isolation not necessarily controllable by
end-users.
We would love to contribute to a standard framework around
performance management for container runtimes.
µArch Features & Container Runtimes
Efficiency
Availability
Performance
30
Takeaways and Future work
● Memory bandwidth and low-level isolation issues becoming more significant.
● Continuous monitoring is critical to run successful multi-tenant hosts.
● Defining requirements for h/w providers and s/w interfaces on QoS knobs.
○ Critical to have these solutions work for containers / process-groups.
● Increasing success rate with current approach:
○ Handling of minimum guaranteed membw usage
○ Handling logically related jobs - Borg allocs
● A general framework would help collaboration.
● Future : Memory BW scheduling (based on hints)
○ Based on membw usage
○ Based on membw sensitivity
31
Find us at the conf or reach out at :
davidlo@
dragoss@
google.com
jnagal@
eranian@
Thanks !
32

Más contenido relacionado

La actualidad más candente

Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdfAdrian Huang
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Adrian Huang
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelAdrian Huang
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocatorsHao-Ran Liu
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)shimosawa
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdfAdrian Huang
 
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...Adrian Huang
 
Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelAdrian Huang
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debuggingHao-Ran Liu
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdfAdrian Huang
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking ExplainedThomas Graf
 
Staring into the eBPF Abyss
Staring into the eBPF AbyssStaring into the eBPF Abyss
Staring into the eBPF AbyssSasha Goldshtein
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Pankaj Suryawanshi
 
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netReceive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netYan Vugenfirer
 

La actualidad más candente (20)

Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdf
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux Kernel
 
Linux kernel memory allocators
Linux kernel memory allocatorsLinux kernel memory allocators
Linux kernel memory allocators
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
semaphore & mutex.pdf
semaphore & mutex.pdfsemaphore & mutex.pdf
semaphore & mutex.pdf
 
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
 
Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux Kernel
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdf
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
 
Staring into the eBPF Abyss
Staring into the eBPF AbyssStaring into the eBPF Abyss
Staring into the eBPF Abyss
 
Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)
 
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netReceive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
 

Similar a Memory Bandwidth QoS

R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsJoshua Mora
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackRed_Hat_Storage
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneOpen-NFP
 
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLEQ2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLELinaro
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
 
High Speed Design Closure Techniques-Balachander Krishnamurthy
High Speed Design Closure Techniques-Balachander KrishnamurthyHigh Speed Design Closure Techniques-Balachander Krishnamurthy
High Speed Design Closure Techniques-Balachander KrishnamurthyMassimo Talia
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...HostedbyConfluent
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»Anna Shymchenko
 
Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg Ceph Community
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009James McGalliard
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016Koan-Sin Tan
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedShubham Tagra
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxssuser30e7d2
 

Similar a Memory Bandwidth QoS (20)

R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data Plane
 
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLEQ2.12: Existing Linux Mechanisms to Support big.LITTLE
Q2.12: Existing Linux Mechanisms to Support big.LITTLE
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
High Speed Design Closure Techniques-Balachander Krishnamurthy
High Speed Design Closure Techniques-Balachander KrishnamurthyHigh Speed Design Closure Techniques-Balachander Krishnamurthy
High Speed Design Closure Techniques-Balachander Krishnamurthy
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
 
Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Day Chicago - Ceph at work at Bloomberg
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
 
Caching in
Caching inCaching in
Caching in
 
module01.ppt
module01.pptmodule01.ppt
module01.ppt
 
module4.ppt
module4.pptmodule4.ppt
module4.ppt
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016
 
Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
 

Más de Rohit Jnagal

Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIURohit Jnagal
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoringRohit Jnagal
 
Kubernetes intro public - kubernetes meetup 4-21-2015
Kubernetes intro   public - kubernetes meetup 4-21-2015Kubernetes intro   public - kubernetes meetup 4-21-2015
Kubernetes intro public - kubernetes meetup 4-21-2015Rohit Jnagal
 

Más de Rohit Jnagal (6)

Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Kubernetes intro public - kubernetes meetup 4-21-2015
Kubernetes intro   public - kubernetes meetup 4-21-2015Kubernetes intro   public - kubernetes meetup 4-21-2015
Kubernetes intro public - kubernetes meetup 4-21-2015
 
Docker n co
Docker n coDocker n co
Docker n co
 
Docker Overview
Docker OverviewDocker Overview
Docker Overview
 
Docker internals
Docker internalsDocker internals
Docker internals
 

Último

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Memory Bandwidth QoS

  • 1. David Lo, Dragos Sbirlea, Rohit Jnagal Managing Memory Bandwidth Antagonism @ Scale
  • 3. Borg Model ● Large clusters with multi-tenant hosts. ● Run a mix of : ○ high and low priority workloads. ○ latency-sensitive and batch workloads. ● Isolation through bare-metal containers (cgroups/namespaces) ○ Cgroups and perf to monitor host and job performance. ○ Cgroups and h/w controls to manage on-node performance. ○ Cluster scheduling and balancing manages service performance. Efficiency Availability Performance 3
  • 4. The Memory Bandwidth Problem ● Large variation in performance on multi-tenant hosts. ● On average, saturation events are few, but: ○ Periodically causes significant cluster-wide performance degradation. ● Some workloads are much more seriously affected than others. ○ Does not necessarily correlate with victim’s memory bandwidth use. Latency time Antagonist task starts 4 Note : This talk is focussed on membw problem for general servers and does not cover GPUs and other special devices. Similar techniques apply there too.
  • 5. Memory BW Saturation is Increasing Over Time Nov 2018 5 Time Fractionofmachineswithsaturation Jan 2018 Fraction of machines that experienced mem BW saturation
  • 6. ● Large machines need to pack more jobs to maintain utilization, resulting in more “noisy neighbor” problems. Why It Is a (Bigger) Problem Now ● ML workloads are memory BW intensive 6
  • 7. ● Track per-socket local and remote memory bandwidth use ● Identify per-platform thresholds for performance dips (saturation) ● Characterize saturation by platform and clusters Understanding the Scope : Socket-Level MonitoringMEM Local Write Remote MEM LocalRemote Read WriteWriteReadRead WriteRead Socket 0 Socket 1 7
  • 8. Saturation behavior varies with platform and cluster, due to ● hardware differences (membw/core ratio) ● workload (large CPU consumers run on bigger platforms) Platform and Cluster Variation 8 By platform By cluster
  • 9. ● Socket-level information gives the magnitude of the problem and hot-spots ● Need task-level information to identify: ○ Abusers : tasks using disproportionate amount of bandwidth ○ Victims : tasks seeing performance drop ● New platforms provide task-level memory bandwidth monitoring, but: ○ RDT cgroup was on its way out ○ Have no data on older platforms For our purposes, a rough attribution of memory bandwidth was good enough Monitoring Sockets ↣ Monitoring Tasks Saturation threshold 9 Totalmemorybandwidth MemoryBWbreakdown
  • 10. ● Summary of requirements: ○ Local and remote bandwidth breakdown ○ Compatible with with cgroup model ● What's available in hardware? ○ Uncore counters (IMC, CHA) ■ Difficult to attribute to HyperThread => cgroup ○ CPU PMU counters ■ Counters are HyperThread local ■ Works with cgroup profiling mode D D R I M C CPU Core CHA HT0 HT1 CPU Core CHA HT0 HT1 Per-task Memory Bandwidth Estimation 10
  • 11. ● OFFCORE_RESPONSE for Intel CPUs ● Programmable filter to specify events of interest (i.e. DRAM local and DRAM remote) ● Captures both demand load and HW prefetcher traffic ● Online documentation of the meaning of bits, per CPU (download.01.org) ● How to interpret: cache lines / sec X 64b/cache line = BW Intel SDM Vol 3 Which CPU Perfmon to Use? 11
  • 12. Abuser insights ● Large percentage of time, a single consumer uses up most bandwidth. ● The share of CPU of that consumer are much lower than its share of membw. Victim insight ● Many jobs are sensitive to membw saturation. ● Jobs are sensitive even though they are not big users of membw. Guidance on enforcement options ● How much saturation would we avoid if we do X? ● Which jobs would get caught in the crossfire? Insights from Task Measurement CPI degradation on saturation (as a fraction) Numberofjobs Combinations of jobs (by CPU requirements) during saturation 12
  • 13. Enforcement : Controlling Different Workloads BW Usage Priority Moderate Heavy LowMediumHigh Isolate Disable ThrottleThrottle Reactive rescheduling Isolate 13
  • 14. What Can We Do ? Node and Cluster Level Actuators Node Memory Bandwidth Allocation in hardware Use HW QoS to apply max limits to tasks overusing memory bandwidth. CPU throttling for indirect control Limit CPU access of over-using tasks to indirectly limit the memory bandwidth used. Cluster Reactive evictions & re-scheduling Hosts experiencing memory BW saturation signals scheduler to re-distribute bigger memory bandwidth users to lightly-loaded machines. Disabling heavy antagonist workloads Tasks that saturate a socket by itself cannot be effectively redistributed. If slowing down is not an option, de-schedule them. 14
  • 15. + Very effective in reducing saturation; + Works on all platforms Node : CPU Throttling Socket 0 (saturated) Socket 1 CPUs running memBW over-users - Too coarse in granularity; - Interacts poorly with Autoscaling & Load-balancing 15
  • 16. Socket memory BW saturation detector Cgroup memory BW estimator Memory BW enforcer Socket perf counters Every x seconds If socket BW > saturation threshold Socket, Cgroup perf counters Profile potentially eligible tasks Policy filter CPU runnable mask Select eligible tasks for throttling If socket BW < unthrottle threshold, unthrottle tasks 16 Throttling - Enforcement Algorithm
  • 17. Node : Memory Bandwidth Allocation Intel RDT Memory Bandwidth Allocation + Reduced bandwidth without lowering CPU utilization. + Somewhat fine-grained than cpu-level controls. - Newer platforms only. - Can’t isolate well between hyperthreads. Supported through resctrl in kernel (more on that later) 17
  • 18. In many cases, there are: ● A low-percentage of saturated sockets in cluster, and ● Multiple tasks contributing to saturation. Re-scheduling the tasks to less loaded machines can avoid slow-downs. Does not help with large antagonists that can saturate any socket it runs on. Cluster : Reactive Re-Scheduling ObserverScheduler host A host B host C host D saturated 1.Callforhelp 2.Evict 3.Reschedule 18
  • 19. Low priority jobs can be dealt at node-level through throttling. If SLOs do not permit throttling and the antagonists cannot be redistributed : ● Disable (kick out of the cluster) ● Users can then reconfigure their service to use different product. ● Area of continual work. Alternative : ● Colocate multiple antagonists (that’s just working around SLOs) Handling Cluster-Wide Saturation Cluster Membw distribution amenable to rescheduling Cluster Membw distribution amenable to job disabling Saturation threshold Saturation threshold 19
  • 20. Results : CPU Throttling + Rescheduling 20
  • 22. ● New, unified interface: resctrl ● resctrl is a big improvement over the previous non-standard cgroup interface ● Uniform way of monitoring/controlling HW QoS across vendors/architectures ○ AMD, ARM, Intel ● (Non-exhaustive) list of HW features supported: ○ Memory BW monitoring ○ Memory BW throttling ○ L3 cache usage monitoring ○ L3 cache partitioning resctrl : HW QoS Support in Kernel 22
  • 23. ● Below is using x86 terminology ● CLass of Service ID (CLOSID): maps to a QoS configuration. Typically O(10) unique ones in HW. ● Resource Monitoring ID (RMID): used to tag workloads and their used resources to aggregate their resource usage. Typically O(100) unique ones in HW. Intro to HW QoS Terms and Concepts Hi priority (CLOSID 0) 100% L3 cache 100% mem BW Low priority (CLOSID 1) 50% L3 cache 20% mem BW RMID0 RMID1 RMID2 RMID3 RMID4 Workload A Workload B Workload C 23
  • 24. resctrl/ |- groupA/ | |- mon_groups/ | | |- monA/ | | | |- mon_data/ | | | |- tasks | | | |- ... | | |- monB/ | | |- mon_data/ | | |- ... | |- schemata | |- tasks | |- ... |- groupB/ |- ... Overview of resctrl Filesystem Documentation: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt A resource control group. Represents one unique HW CLOSID. A monitoring group. Represents one unique HW RMID. TIDs in monitoring group TIDs in resource control group QoS configuration for resource control group Resource usage data for entire resource control group Resource usage data for monitoring group 24
  • 25. Example Usage of resctrl Interfaces $ cat groupA/schemata L3:0=ff;1=ff MB:0=90;1=90 $ READING0=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes) $ sleep 1 $ READING1=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes) $ echo $((READING1-READING0)) 1816234126 Allowed to use 8 cache ways for L3 on both sockets. Per-core memory BW constrained to 90% on both sockets. Compute memory BW by taking a rate. In this case, BW ~= 1.8GiB/s 25
  • 26. Reconciling resctrl and cgroups: First Try resctrl/ |- no_throttle/ | |- mon_groups/ | | |- cgroupX/ | | | |- mon_data/ | | | |- tasks | | | |- ... | | |- monB/ | | |- mon_data/ | | |- ... | |- schemata | |- tasks | |- ... |- bw_throttled/ |- ... << #1 << #1 << #1 << #3 << #5 ↻ << #6 ↻ Use case: dynamically apply memory BW throttling if machine is in trouble 1. Node SW creates 2 resctrl groups: no_throttle and bw_throttled 2. On cgroup creation, logically assign cgroupX to no_throttle 3. Create a mongroup for cgroupX in no_throttle 4. Start cgroupX 5. Move TIDs into no_throttle/tasks 6. Move TIDs into no_throttle/mon_groups/cgroupX/tasks 7. Move TIDs of high BW user into bw_throttled 26
  • 27. Use case: dynamically apply memory BW throttling if machine is in trouble 1. Node SW creates 2 resctrl groups: no_throttle and bw_throttled 2. On cgroup creation, logically assign cgroupX to no_throttle 3. Create a mongroup for cgroupX in no_throttle 4. Start cgroupX 5. Move TIDs into no_throttle/tasks 6. Move TIDs into no_throttle/mon_groups/cgroupX/tasks 7. Move TIDs of high BW user into bw_throttled Challenges with Naive Approach Race in moving TIDs if cgroup is creating threads. Expensive if lots of TIDs and to deal with the race. Desynchronization of L3 cache occupancy data, since existing data is tagged with an old RMID. 27
  • 28. ● What if we had the ability to have a 1:1 mapping of cgroups to resctrl groups ○ To change QoS configs, just rewrite schemata ○ More efficient, remove need to move TIDs around ○ Keep existing RMID, prevent L3 occupancy desynchronization issue ○ 100% compatible with existing resctrl abstraction ● CHALLENGE: with existing system, will run out of CLOSIDs very quickly ● SOLUTION: share CLOSIDs between resource control groups with the same schemata ● Google-developed kernel patch for this functionality to be released soon ● Demonstrates need to make cgroup model a first class consideration for QoS interfaces A Better Approach for resctrl and cgroups 28
  • 29. cgroups and resctrl with After the Change resctrl/ |- cgroupX/ | |- mon_groups/ | | |- mon_data/ | | |- ... | |- schemata | |- tasks | |- ... |- high_bw_cgroup/ | |- schemata | |- ... |- ... << #1 << #4 ↻ Use case: dynamically apply memory BW throttling if machine is in trouble 1. Create a resctrl group cgroupX 2. Write no throttling configuration to cgroupX/schemata 3. Start cgroupX 4. Move TIDs into cgroupX/tasks 5. Rewrite schemata of high BW using cgroup to throttle << #2 << #5 29
  • 30. ● Measuring µArch impact is not a first class component of most container runtimes. ○ Can’t manage what we can’t see... ● Most container runtimes expose isolation knobs per container. ● Managing µArch isolation requires node and cluster level feedback-loops. ○ Dual operating mode : admins & users. ○ Performance isolation not necessarily controllable by end-users. We would love to contribute to a standard framework around performance management for container runtimes. µArch Features & Container Runtimes Efficiency Availability Performance 30
  • 31. Takeaways and Future work ● Memory bandwidth and low-level isolation issues becoming more significant. ● Continuous monitoring is critical to run successful multi-tenant hosts. ● Defining requirements for h/w providers and s/w interfaces on QoS knobs. ○ Critical to have these solutions work for containers / process-groups. ● Increasing success rate with current approach: ○ Handling of minimum guaranteed membw usage ○ Handling logically related jobs - Borg allocs ● A general framework would help collaboration. ● Future : Memory BW scheduling (based on hints) ○ Based on membw usage ○ Based on membw sensitivity 31
  • 32. Find us at the conf or reach out at : davidlo@ dragoss@ google.com jnagal@ eranian@ Thanks ! 32