SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
CAT @ Scale
Deploying cache isolation in a mixed-workload environment
Rohit Jnagal jnagal@google
David Lo davidlo@google
David
Rohit
Borg : Google cluster manager
● Admits, schedules, starts, restarts, and
monitors the full range of applications
that Google runs.
● Mixed workload system - two tiered
: latency sensitive ( front-end tasks)
: latency tolerant (batch tasks)
● Uses containers/cgroups to isolate
applications.
Borg: Efficiency with multiple tiers
Large Scale Cluster Management at Google with Borg
Isolation in Borg
Borg: CPU isolation for latency-sensitive (LS) tasks
● Linux Completely Fair Scheduling (CFS) is a throughput-oriented
scheduler; no support for differentiated latency
● Google-specific extensions for low-latency scheduling response
● Enforce strict priority for LS tasks over batch workloads
○ LS tasks always preempt batch tasks
○ Batch never preempts latency-sensitive on wakeup
○ Bounded execution time for batch tasks
● Batch tasks treated as minimum weight entities
○ Further tuning to ensure aggressive distribution of batch tasks over available cores
Borg : NUMA Locality
Good NUMA locality can have a
significant performance impact
(10-20%)*
Borg isolates LS tasks to a single
socket, when possible
Batch tasks are allowed to run on
all sockets for better throughput
* The NUMA experience
Borg : Enforcing locality for performance
Borg isolates LS tasks to a single socket, when possible
Batch tasks are allowed to run on all sockets for better throughput
LS1
LS2
LS3
Batch
Affinity masks for tasks on a machine
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
Socket 0 Socket 1
8 9 10 11 12 13 14 15
Borg : Dealing with LS-LS interference
Use reserved CPU sets to limit interference for highly sensitive jobs
○ Better wakeup latencies
○ Still allows batch workloads as they have minimum weight and always yield
Socket 0
Affinity masks for tasks on a machine
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
Socket 1
LS1
LS2 (reserved)
LS3
Batch
8 9 10 11 12 13 14 15LS4
Borg : Micro-architectural interference
● Use exclusive CPU sets to limit microarchitectural interference
○ Disallow batch tasks from running on cores of an LS task
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 150 1 2 3 4 5 6 7
Socket 0 Socket 1
LS1
LS2 (reserved)
LS3 (exclusive)
LS4
Batch
Affinity masks for tasks on a machine
8 9 10 11 12 13 14 15
Borg : Isolation for highly sensitive tasks
● CFS offers low scheduling latency
● NUMA locality provides local memory and cache
● Reserved cores keep LS tasks with comparable weights from
interfering
● Exclusive cores keep cache-heavy batch tasks away from L1, L2
caches
This should be as good as running on a non-shared infrastructure!
Co-located Exclusive LS & streaming MR
Start of streaming MR
github.com/google/multichase
Exclusive job with great
latency
Performance for latency sensitive tasks
At lower utilization, latency
sensitive tasks need more
cache protection.
Interference can degrade
performance up to 300% even
when all other resources are
well isolated.
Mo Cores, Mo Problems
*Heracles
CAT
Resource Director Technology (RDT)
● Monitoring:
○ Cache Monitoring Technology (CMT)
○ Memory Bandwidth Monitoring (MBM)
● Allocation:
○ Cache Allocation Technology (CAT)
■ L2 and L3 Caches
○ Code and Data Prioritization (CDP)
Actively allocate resources to achieve better QoS and performance
Allows general grouping to enable monitoring/allocation for VMs,
containers, and arbitrary threads and processes
Introduction to CAT
Cache Allocation Technology (CAT)
● Provides software control to
isolate last-level cache access
between applications.
● CLOS: Class of service
corresponding to a cache
allocation setting
● CBM: Cache Capacity Bitmasks
to map a CLOS id to an
allocation mask
Introduction to CAT
Setting up CAT
Introduction to CAT
Let’s add CAT to our service ...
Add CAT to the mix
Start of streaming MR
Restricting MR cache
use to 50%
CAT Deployment: Batch Jails
Data for batch jobs Data for latency sensitive jobs
Batch jail
(shared between all tasks, including LS)
Dedicated for latency sensitive
(only LS can use)
CAT Deployment: Cache cgroup
Cache
T1 T2
CPU
T1 T2
Memory
T1 T2
Cgroup
Root
● Every app gets its own cgroup
● Set CBM for all batch tasks to
same mask
● Easy to inspect, recover
● Easy to integrate into existing
container mechanisms
○ Docker
○ Kubernetes
CAT experiments with YouTube transcoder
CAT experiments with YouTube
CPI as a good measure for cache interference
lower is
better
Antagonistcacheoccupancy(%ofL3)
CPI
0%
25%
50%
75%
100%
Production rollout
Impact of batch jails
Higher gains for smaller jail
+0%
lower is
better
LS tasks avg CPI comparison
LS tasks CPI percentile comparisonComparison of LS tasks CPI PDF
Batch jails deployment
Batch jailing shifts
CPI lower
Higher benefits of
CAT for tail tasks
+0%
Batch jails deployment
Smaller jails lead to higher
impact on batch jobs
lower is
better
Batch tasks avg CPI comparison
+0%
The Downside: Increased memory pressure
BW spike!
BW hungry
Batch job starts
BW hungry
Batch job stops
Jailing LLC increases DRAM
BW pressure for Batch
System Memory BW
Controlling memory bandwidth impact
Intel RDT: CMT (Cache Monitoring Technology)
- Monitor and profile cache usage pattern for all applications
Intel RDT: MBM (Memory Bandwidth Monitoring)
- Monitor memory bandwidth usage per application
Controls:
- CPU throttling
- Scheduling
- Newer platforms will provide control for memory bandwidth per
application
Controlling infrastructure processes
Many system daemons tend to periodically thrash caches
- None of them are latency sensitive
- Stable behavior, easy to identify
Jailing for daemons!
- Requires ability to restrict kernel threads to a mask
What about the noisy neighbors?
Noisy neighbors hurting
performance (Intel RDT)
● Use CMT to detect; CAT to
control
● Integrated into CPI2
signals
○ CPI2
built for noisy neighbor
detection
○ Dynamically throttle noisy tasks
○ Possibly tag for scheduling hints
Observer
Master
Nodes
CPI
Samples
CPI
Spec
CMT issues with cgroups
● Usage model: many many cgroups, but can’t run perf on all of them
all the time
○ Run perf periodically on a sample of cgroups
○ Use same RMID for a bunch of cgroups
○ Rotate cgroups out every sampling period
● HW counts cache allocations - deallocations, not occupancy:
○ Cache lines allocated before perf runs are not accounted
○ Can get non-sensical results, even zero cache occupancy
○ Work-around requires to run perf for life-time of monitored cgroup
○ Unacceptable context switch overhead
● David Carrillo-Cisneros & Stephane Eranian working on a newer
version for CMT support with perf
CAT implementation
Cache Cgroup
Cache
T1 T2
CPU
T1 T2
Memory
T1 T2
Cgroup
Root
● Every app gets its own cgroup
● Set CBM for all batch tasks to
same mask
● Easy to inspect, recover
● Easy to integrate into existing
container mechanisms
○ Docker
○ Kubernetes
● Issues with the patch:
○ Per-socket masks
○ Not a good fit?
○ Thread-based isolation
vs cgroup v2
New patch: rscctrl interface
● Patches by Intel from Fenghua Yu
○ Mounted under /sys/fs/rscctrl
○ Currently used for L2 and L3 cache masks
○ Create new grouping with mkdir /sys/fs/rscctrl/LS1
○ Files under /sys/fs/rscctrl/LS1:
■ tasks: threads in the group
■ cpus: cpus to control with the setting in this group
■ schemas: write L2 and L3 CBMs to this file
● Aligns better with the h/w capabilities provided
● Gives finer control without worrying about cgroup restrictions
● Gives control over kernel threads as well as user threads
● Allows resource allocation policies to be tied to certain cpus across all
contexts
Current Kernel patch progress
David Carrillo-Cisneros, Fenghua Yu, Vikas Shivappa, and others at Intel
working on improving CMT and MBM support for cgroups
Changes to support cgroup monitoring as opposed to attach to process
forever model
Challenges that are being faced:
● Sampled collections
● Not enough RMIDs to go around
○ Use per-package allocation of RMIDs
○ Reserved RMIDs (do not rotate)
Takeaways
● With larger machines, isolation between
workloads is more important than ever.
● RDT extensions work really great at scale:
○ Easy to set up static policies.
○ Lot of flexibility.
● CAT is only one of the first
isolation/monitoring features.
○ Avoid ad-hoc solutions
● At Google, we cgroups and containers:
○ Rolled out cgroup based CAT support to the fleet.
● Let’s get the right abstractions in place.
If you are interested,
talk to us here or find us
online:
jnagal
davidlo
davidcc
eranian
@google
Thanks!
● Friday 8am - 1pm @ Google's Toronto office
● Hear real life experiences of two companies using GKE
● Share war stories with your peers
● Learn about future plans for microservice management
from Google
● Help shape our roadmap
g.co/microservicesroundtable
† Must be able to sign digital NDA
Join our Microservices Customer Roundtable

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common KernelLAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
 
Evaluating the Static-RRIP Cache Replacement Policy
Evaluating the Static-RRIP Cache Replacement PolicyEvaluating the Static-RRIP Cache Replacement Policy
Evaluating the Static-RRIP Cache Replacement Policy
 
Meta-Learning with Memory-Augmented Neural Networks (MANN)
Meta-Learning with Memory-Augmented Neural Networks (MANN)Meta-Learning with Memory-Augmented Neural Networks (MANN)
Meta-Learning with Memory-Augmented Neural Networks (MANN)
 
G1 collector and tuning and Cassandra
G1 collector and tuning and CassandraG1 collector and tuning and Cassandra
G1 collector and tuning and Cassandra
 
Yet another introduction to Linux RCU
Yet another introduction to Linux RCUYet another introduction to Linux RCU
Yet another introduction to Linux RCU
 
Linux Memory
Linux MemoryLinux Memory
Linux Memory
 
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
 
Let's Talk Locks!
Let's Talk Locks!Let's Talk Locks!
Let's Talk Locks!
 
Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory model
 
High Performance Computing using MPI
High Performance Computing using MPIHigh Performance Computing using MPI
High Performance Computing using MPI
 
Hugepage
HugepageHugepage
Hugepage
 
Parallel Processing
Parallel ProcessingParallel Processing
Parallel Processing
 
Valgrind
ValgrindValgrind
Valgrind
 
Unit IV Memory and I/O Organization
Unit IV Memory and I/O OrganizationUnit IV Memory and I/O Organization
Unit IV Memory and I/O Organization
 
Distributed system
Distributed systemDistributed system
Distributed system
 
Podman rootless containers
Podman rootless containersPodman rootless containers
Podman rootless containers
 
LAS16-307: Benchmarking Schedutil in Android
LAS16-307: Benchmarking Schedutil in AndroidLAS16-307: Benchmarking Schedutil in Android
LAS16-307: Benchmarking Schedutil in Android
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
qemu + gdb: The efficient way to understand/debug Linux kernel code/data stru...
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
 

Destacado

VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
aidanshribman
 
SLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA SystemsSLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA Systems
Davidlohr Bueso
 
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
sprdd
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
Sim Janghoon
 
Multi-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & AnalysisMulti-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & Analysis
inside-BigData.com
 

Destacado (20)

VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
VHPC'12: Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applic...
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Leveraging memory in sql server
Leveraging memory in sql serverLeveraging memory in sql server
Leveraging memory in sql server
 
SLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA SystemsSLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA Systems
 
美团技术沙龙04 - 高性能服务器架构设计和调优
美团技术沙龙04 - 高性能服务器架构设计和调优美团技术沙龙04 - 高性能服务器架构设计和调优
美团技术沙龙04 - 高性能服务器架构设计和调优
 
Tuning Linux for Databases.
Tuning Linux for Databases.Tuning Linux for Databases.
Tuning Linux for Databases.
 
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
 
Aca 2
Aca 2Aca 2
Aca 2
 
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaaf
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaafTechnical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaaf
Technical update KVM and Red Hat Enterprise Virtualization (RHEV) by syedmshaaf
 
Cache & CPU performance
Cache & CPU performanceCache & CPU performance
Cache & CPU performance
 
Linux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesLinux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and Opportunities
 
美团技术团队 - KVM性能优化
美团技术团队 - KVM性能优化美团技术团队 - KVM性能优化
美团技术团队 - KVM性能优化
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?
 
Multiprocessor system
Multiprocessor systemMultiprocessor system
Multiprocessor system
 
Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
Multiprocessor architecture
Multiprocessor architectureMultiprocessor architecture
Multiprocessor architecture
 
Multiple processor (ppt 2010)
Multiple processor (ppt 2010)Multiple processor (ppt 2010)
Multiple processor (ppt 2010)
 
Multi-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & AnalysisMulti-Physics Methods, Modeling, Simulation & Analysis
Multi-Physics Methods, Modeling, Simulation & Analysis
 
Virtualization Architecture & KVM
Virtualization Architecture & KVMVirtualization Architecture & KVM
Virtualization Architecture & KVM
 

Similar a Cat @ scale

Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Michael Christofferson
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
madhuinturi
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
ScyllaDB
 

Similar a Cat @ scale (20)

Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
RTDroid_Presentation
RTDroid_PresentationRTDroid_Presentation
RTDroid_Presentation
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Js on-microcontrollers
Js on-microcontrollersJs on-microcontrollers
Js on-microcontrollers
 
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0Four Ways to Improve Linux Performance IEEE Webinar, R2.0
Four Ways to Improve Linux Performance IEEE Webinar, R2.0
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Q2.12: Research Update on big.LITTLE MP Scheduling
Q2.12: Research Update on big.LITTLE MP SchedulingQ2.12: Research Update on big.LITTLE MP Scheduling
Q2.12: Research Update on big.LITTLE MP Scheduling
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
 
Realtime
RealtimeRealtime
Realtime
 
The Tux 3 Linux Filesystem
The Tux 3 Linux FilesystemThe Tux 3 Linux Filesystem
The Tux 3 Linux Filesystem
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationHKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
 
Measuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data PlaneMeasuring a 25 and 40Gb/s Data Plane
Measuring a 25 and 40Gb/s Data Plane
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
Numascale Product IBM
Numascale Product IBMNumascale Product IBM
Numascale Product IBM
 

Último

Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Precisely
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 

Último (20)

ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 

Cat @ scale

  • 1. CAT @ Scale Deploying cache isolation in a mixed-workload environment Rohit Jnagal jnagal@google David Lo davidlo@google
  • 2. David Rohit Borg : Google cluster manager ● Admits, schedules, starts, restarts, and monitors the full range of applications that Google runs. ● Mixed workload system - two tiered : latency sensitive ( front-end tasks) : latency tolerant (batch tasks) ● Uses containers/cgroups to isolate applications.
  • 3. Borg: Efficiency with multiple tiers Large Scale Cluster Management at Google with Borg
  • 5. Borg: CPU isolation for latency-sensitive (LS) tasks ● Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; no support for differentiated latency ● Google-specific extensions for low-latency scheduling response ● Enforce strict priority for LS tasks over batch workloads ○ LS tasks always preempt batch tasks ○ Batch never preempts latency-sensitive on wakeup ○ Bounded execution time for batch tasks ● Batch tasks treated as minimum weight entities ○ Further tuning to ensure aggressive distribution of batch tasks over available cores
  • 6. Borg : NUMA Locality Good NUMA locality can have a significant performance impact (10-20%)* Borg isolates LS tasks to a single socket, when possible Batch tasks are allowed to run on all sockets for better throughput * The NUMA experience
  • 7. Borg : Enforcing locality for performance Borg isolates LS tasks to a single socket, when possible Batch tasks are allowed to run on all sockets for better throughput LS1 LS2 LS3 Batch Affinity masks for tasks on a machine 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 Socket 0 Socket 1 8 9 10 11 12 13 14 15
  • 8. Borg : Dealing with LS-LS interference Use reserved CPU sets to limit interference for highly sensitive jobs ○ Better wakeup latencies ○ Still allows batch workloads as they have minimum weight and always yield Socket 0 Affinity masks for tasks on a machine 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 Socket 1 LS1 LS2 (reserved) LS3 Batch 8 9 10 11 12 13 14 15LS4
  • 9. Borg : Micro-architectural interference ● Use exclusive CPU sets to limit microarchitectural interference ○ Disallow batch tasks from running on cores of an LS task 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 Socket 0 Socket 1 LS1 LS2 (reserved) LS3 (exclusive) LS4 Batch Affinity masks for tasks on a machine 8 9 10 11 12 13 14 15
  • 10. Borg : Isolation for highly sensitive tasks ● CFS offers low scheduling latency ● NUMA locality provides local memory and cache ● Reserved cores keep LS tasks with comparable weights from interfering ● Exclusive cores keep cache-heavy batch tasks away from L1, L2 caches This should be as good as running on a non-shared infrastructure!
  • 11. Co-located Exclusive LS & streaming MR Start of streaming MR github.com/google/multichase Exclusive job with great latency
  • 12. Performance for latency sensitive tasks At lower utilization, latency sensitive tasks need more cache protection. Interference can degrade performance up to 300% even when all other resources are well isolated. Mo Cores, Mo Problems *Heracles
  • 13. CAT
  • 14. Resource Director Technology (RDT) ● Monitoring: ○ Cache Monitoring Technology (CMT) ○ Memory Bandwidth Monitoring (MBM) ● Allocation: ○ Cache Allocation Technology (CAT) ■ L2 and L3 Caches ○ Code and Data Prioritization (CDP) Actively allocate resources to achieve better QoS and performance Allows general grouping to enable monitoring/allocation for VMs, containers, and arbitrary threads and processes Introduction to CAT
  • 15. Cache Allocation Technology (CAT) ● Provides software control to isolate last-level cache access between applications. ● CLOS: Class of service corresponding to a cache allocation setting ● CBM: Cache Capacity Bitmasks to map a CLOS id to an allocation mask Introduction to CAT
  • 17. Let’s add CAT to our service ...
  • 18. Add CAT to the mix Start of streaming MR Restricting MR cache use to 50%
  • 19. CAT Deployment: Batch Jails Data for batch jobs Data for latency sensitive jobs Batch jail (shared between all tasks, including LS) Dedicated for latency sensitive (only LS can use)
  • 20. CAT Deployment: Cache cgroup Cache T1 T2 CPU T1 T2 Memory T1 T2 Cgroup Root ● Every app gets its own cgroup ● Set CBM for all batch tasks to same mask ● Easy to inspect, recover ● Easy to integrate into existing container mechanisms ○ Docker ○ Kubernetes
  • 21. CAT experiments with YouTube transcoder
  • 22. CAT experiments with YouTube CPI as a good measure for cache interference lower is better Antagonistcacheoccupancy(%ofL3) CPI 0% 25% 50% 75% 100%
  • 24. Impact of batch jails Higher gains for smaller jail +0% lower is better LS tasks avg CPI comparison
  • 25. LS tasks CPI percentile comparisonComparison of LS tasks CPI PDF Batch jails deployment Batch jailing shifts CPI lower Higher benefits of CAT for tail tasks +0%
  • 26. Batch jails deployment Smaller jails lead to higher impact on batch jobs lower is better Batch tasks avg CPI comparison +0%
  • 27. The Downside: Increased memory pressure BW spike! BW hungry Batch job starts BW hungry Batch job stops Jailing LLC increases DRAM BW pressure for Batch System Memory BW
  • 28. Controlling memory bandwidth impact Intel RDT: CMT (Cache Monitoring Technology) - Monitor and profile cache usage pattern for all applications Intel RDT: MBM (Memory Bandwidth Monitoring) - Monitor memory bandwidth usage per application Controls: - CPU throttling - Scheduling - Newer platforms will provide control for memory bandwidth per application
  • 29. Controlling infrastructure processes Many system daemons tend to periodically thrash caches - None of them are latency sensitive - Stable behavior, easy to identify Jailing for daemons! - Requires ability to restrict kernel threads to a mask
  • 30. What about the noisy neighbors? Noisy neighbors hurting performance (Intel RDT) ● Use CMT to detect; CAT to control ● Integrated into CPI2 signals ○ CPI2 built for noisy neighbor detection ○ Dynamically throttle noisy tasks ○ Possibly tag for scheduling hints Observer Master Nodes CPI Samples CPI Spec
  • 31. CMT issues with cgroups ● Usage model: many many cgroups, but can’t run perf on all of them all the time ○ Run perf periodically on a sample of cgroups ○ Use same RMID for a bunch of cgroups ○ Rotate cgroups out every sampling period ● HW counts cache allocations - deallocations, not occupancy: ○ Cache lines allocated before perf runs are not accounted ○ Can get non-sensical results, even zero cache occupancy ○ Work-around requires to run perf for life-time of monitored cgroup ○ Unacceptable context switch overhead ● David Carrillo-Cisneros & Stephane Eranian working on a newer version for CMT support with perf
  • 33. Cache Cgroup Cache T1 T2 CPU T1 T2 Memory T1 T2 Cgroup Root ● Every app gets its own cgroup ● Set CBM for all batch tasks to same mask ● Easy to inspect, recover ● Easy to integrate into existing container mechanisms ○ Docker ○ Kubernetes ● Issues with the patch: ○ Per-socket masks ○ Not a good fit? ○ Thread-based isolation vs cgroup v2
  • 34. New patch: rscctrl interface ● Patches by Intel from Fenghua Yu ○ Mounted under /sys/fs/rscctrl ○ Currently used for L2 and L3 cache masks ○ Create new grouping with mkdir /sys/fs/rscctrl/LS1 ○ Files under /sys/fs/rscctrl/LS1: ■ tasks: threads in the group ■ cpus: cpus to control with the setting in this group ■ schemas: write L2 and L3 CBMs to this file ● Aligns better with the h/w capabilities provided ● Gives finer control without worrying about cgroup restrictions ● Gives control over kernel threads as well as user threads ● Allows resource allocation policies to be tied to certain cpus across all contexts
  • 35. Current Kernel patch progress David Carrillo-Cisneros, Fenghua Yu, Vikas Shivappa, and others at Intel working on improving CMT and MBM support for cgroups Changes to support cgroup monitoring as opposed to attach to process forever model Challenges that are being faced: ● Sampled collections ● Not enough RMIDs to go around ○ Use per-package allocation of RMIDs ○ Reserved RMIDs (do not rotate)
  • 36. Takeaways ● With larger machines, isolation between workloads is more important than ever. ● RDT extensions work really great at scale: ○ Easy to set up static policies. ○ Lot of flexibility. ● CAT is only one of the first isolation/monitoring features. ○ Avoid ad-hoc solutions ● At Google, we cgroups and containers: ○ Rolled out cgroup based CAT support to the fleet. ● Let’s get the right abstractions in place. If you are interested, talk to us here or find us online: jnagal davidlo davidcc eranian @google Thanks!
  • 37. ● Friday 8am - 1pm @ Google's Toronto office ● Hear real life experiences of two companies using GKE ● Share war stories with your peers ● Learn about future plans for microservice management from Google ● Help shape our roadmap g.co/microservicesroundtable † Must be able to sign digital NDA Join our Microservices Customer Roundtable