Automating the Hunt for Non-Obvious Sources of Latency Spreads

Automating the Hunt for
Non-Obvious Sources of
Latency Spreads
Kshitij Doshi, Sr. Principal Engr at Intel
Harshad S. Sane, Principal Engr at Intel
Datacenter
& AI

Kshitij Doshi
■ Ph.D., Rice Univ – Comm. eﬃcient parallel algorithms
■ Performance of Systems, DB, Cloud-native apps
■ Research interests in storage, memory, distributed systems
■ 20 y at Intel; previously,13 y at Unix Systems Labs & Novell.
Datacenter
& AI

Harshad Sane
■ Harshad Sane is a Principal Engineer in Intel's Data Center
and AI group
■ Deep technical expertise in system software, memory, and
CPU architectures.
■ Specializes in Performance Engineering with extensive
experience and expertise in Telemetry, Observability,
Monitoring, Software optimization.
Datacenter
& AI

■ Section 1 - About tail latency spreads
■ Section 2 - Two non-obvious causes of latency escapes
■ Section 3 - How to decide if either of them are hurting your application
■ Section 4 - Mitigations, if they are hurting your application
■ Section 5 - Summary
Agenda

Hurdles are not always predictable.
Courtesy: bing.com/images

ScyllaDB is engineered for usages needing high
throughputs and predictable, low latencies...
https://resources.scylladb.com/videos/build-low-latency-applications-in-rust-on-scylladb
Query
Commitlog
Compaction
Queue
Queue
Queue
0.5 msec
Userspace
I/O Scheduler
Disk

Frontend
Database
Services
3-tier
architecture
. . . Latency landmines can be present, however, in
other layers and inter-services interactions, or, in
infrastructure services
Microservices
architecture

frequently there is some issue that intersects in an unpredictable manner
with execution of normal hotspots
When Small Performance Fluctuations Magnify
Into Sudden, Large Spikes in Response Times…

repeating over and over with minor perturbations in end to end
latencies for each itearation
Consider a Streamlined Flow of Execution

Such a hiccup . . . propagates and throws both timing and resource usage out of
balance, for some period of time.
But this period of non-streamlined ﬂow can feed on itself and produce secondary
spikes in end-to-end latencies, even as overall ﬂow throughput evens out.
Where Something Goes Out of Balance Momentarily
and Causes a Hiccup.

Consider two such issues . . .

Wait 100 ms
T1
T3
T2
T4
T5
T0
: set random
seed S
use S
use S
use S
use S
use S
Producer
Consumer
Get Y from
queue L
Put X in
queue L
X
Y
Producer frequently modifies tail of queue L, while
consumer frequently modifies head of queue L.
A first module in an application
A second module in the application
First issue:

Wait 100 ms
T1
T3
T2
T4
T5
T0
: set random
seed S
use S
use S
use S
use S
use S
Producer
Consumer
Get Y from
queue L
Put X in
queue L
X
Y
Producer frequently modifies tail of queue L, while
consumer frequently modifies head of queue L.
A first module in an application
A second module in the application
The threads working in the two modules, which have no logical intersection, do however get cross-coupled if the
variable S ends up on a cacheline that is also used for storing either or both of the head / tail pointers of queue L.
Not a significant problem unless updates of queue L become frequent.
FIrst issue: false sharing

High CPU frequency
High power consumption
More instructions/time
Low CPU frequency
Low power consumption
Fewer instructions/time
One of several sleep states
Very low power consumption
Active idle
Operating system and hardware algorithms together with system configuration parameters
determine conditions under which CPUs transition among different states of operation
Second Issue: CPU Active (P-states) and
Sleep (C-states) States

High CPU frequency
Low CPU frequency
Active idle
Transitions out of deeper sleep states into active execution take many microseconds upto the
point of normal instruction execution, and also experience transient effects of colder caches.
Second Issue: CPU Power Management
Transitions

High CPU frequency
Low CPU frequency
Active idle
Transitions out of deeper sleep states into active execution take many microseconds upto the point of normal
instruction execution, and also experience transient effects of colder caches.
Transitions from low power states to normal execution go through a series of frequency step-ups, causing
software actions that are dependency-chained, to stall due to inter-thread or inter-process data/event waiting.
Second issue: CPU Power (and Sleep) State
Transitions

ThrA
ThrB
ThrC
trigger
ThrB
trigger
ThrC
trigger
ThrA
trigger
ThrB
Ideal progression . . .
Time

ThrA
ThrB
ThrC
Imagine that this CPU is
transitioning out of a deep
sleep state . . .

ThrA
ThrB
ThrC
Imagine that this CPU is
transitioning out of a deep
sleep state . . .
Delayed trigger for ThrA
ThrC
runs slower
Possibly this CPU has
entered a deeper sleep
as a result
ThrA
runs slower
as a result
Delayed trigger for ThrB
Cascading Sleep -> Wakeup -> Sleep transients can take up time to
fade out, and cause high peak latencies …
… even though impact on average latency gets amortized.

Detecting and untangling the causes of these intersections of issues is very
challenging. Particularly if a high degree of instrumentation, such as tracing
or logging interferes with and distorts the effects.
These effects are not easily noticed through lightweight sampling or
counting of events

Collecting Traces
Security, Collection overheads at CPU, Caches,
Bandwidth to memory and storage/network.
Analyzing Traces
Like searching for a needle in multiple haystacks
without knowing if a needle is to be found at all.
Scheduling collection
and analysis
Like figuring out when crime is going to occur in order
to launch crime scene analysis
Challenges

• Turbostat
• Powertop
• Runqueue lengths
Indicating Power transitions
• Sharp IPC drop with concurrency
• No obvious mem/disk data bottleneck
• High utilization, low runqueue lengths
Indicating cache coherence issues
Good but circumstantial clues
• CoreFreq -
https://github.com/cyring/CoreFreq
Correlated Events That Can Be Collected
at Low Overhead

Picking up on
disruptive events
Courtesy: bing.com/images

To see what is available:
sudo perf list | grep cstate
Example output:
cstate_core/c3-residency/
cstate_pkg/c2-residency/
…
Capturing C-State Transitions

sudo perf timechart record
sudo perf timechart
Monitoring sleep, wait, and
run times per CPU.
4 threads in intermittent
sleeps
4 threads in variable I/O
wait durations
Timecharting

Per CPU timeline of processes as they are
context switched
2. perf sched map
Visualizing scheduling events by time
1. perf sched timehist
(-Mw): migrations and wakeups
Tracks Scheduler latency by event, including time
to wakeup , latency from wakeup to run (sched
delay)

P-states (≡ frequencies)
‒ BIOS controlled
‒ OS controlled via scaling drivers
‒ HW controlled P states
‒ Turbo
Monitoring and controlling P-states
Credits:https://images.anandtech.com/doci/9582/43.jpg
To monitor the P-states:
‒ Turbostat, CoreFreq
‒ Profiling tools (Perf, Vtune, etc.)
To control P-states: OS controlled P-states manually
configured through scaling drivers. such as
‒ Cpupower
‒ CoreFreq https://github.com/cyring/CoreFreq
to control performance governor for P states.

Response time R
--monitored at
application level
Exp-weighted Moving
Window Avg
Short-range average
Detect upward
heave
C-State Monitoring
P-State Monitoring
Tn = Transitions
count over 250ms
windows
Tn > Threshold
Detect
Overlap
Snapshot
runqlat and
timechart
activity from
last ‘n’ secs

(a) Likelihood with higher concurrency + low scaling + higher response times, with low IPC
despite low LLC misses/instruction
(b) Sensitivity insensitive to runq-lengths (not sensitive to CPU subscription)
(c) Clues higher number of coherence misses in L1 and/or L2 (PMC snoop events
S2I, M2I)
increased inter-socket link utilization in a multi-socket system
Step 1: Establish whether sufficient clues exist to suspect false sharing
Clues for False-Sharing (With Low
Overhead)

Drilling down for concrete evidence of false sharing
perf c2c
Sampling based detection of cachelines where false sharing was likely – based on
the HITM event (see below).
These are read or write accesses for which a different core’s cache reports a “hit” in
“modified” state (HITM).
Provides insights into data addresses, code addresses, processes and threads that
generate sharing conflicts.
Conditionally upon step 1 indicating possible false sharing
Step 2: Collect perf c2c profiles identifying data and code addresses producing the contention

Cacheline summary
Cacheline Access details (look for HITMs)
What perf c2c profiling looks like

When power management actions are suspected to provoke high tail latencies:
1. Choose less extreme power-performance settings
power-save, energy-efficient etc.
1. Explore changes in scheduler tunings, such as –
a. Quicker preemption (reducing wakeup -> onproc)
b. Smaller time-slices
c. Different (usually lower) migration thresholds
Solution Space

When false-sharing is suspected to provoke high tail latencies:
1. Some data structure layout possibilities:
a. Data structure / global variable padding (if possible)
b. Changing the affected data structure to better separate (quasi)- immutable from
mutable cachelines
c. Splitting data structures in question into sub-structures
2. Possible computation strategy changes:
a. Rate-limiting writers to cachelines that are accessed frequently by readers
b. Colocating to same socket or sub-numa clusters
c. Make code bimodal: normal computation until a monitor signals rise in coherence
events, and one of (2a/2b) after.
Solution Space

• Latency instrumentation needs to be made as close to real time as possible.
• Tracing needs to be combined with sampling over short intervals and, triggered by good
precursors so overhead is kept to a minimum.
• We outlined two issues—
• False sharing
• Power management transitions
that may not arise frequently, but can have measurable effects on tail latencies, which can be
hard to detect.
In this presentation we have shown the role these 2 components play in application
performance, their detectability and possible solutions.
Summary

Thank You
Stay in Touch
kshitij.a.doshi@intel.com
harshad.s.sane@intel.com

Automating the Hunt for Non-Obvious Sources of Latency Spreads

Recommended

Recommended

More Related Content

Similar to Automating the Hunt for Non-Obvious Sources of Latency Spreads

Similar to Automating the Hunt for Non-Obvious Sources of Latency Spreads (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Automating the Hunt for Non-Obvious Sources of Latency Spreads