Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Automating the Hunt for Non-Obvious Sources of Latency Spreads

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 36 Anuncio

Automating the Hunt for Non-Obvious Sources of Latency Spreads

Descargar para leer sin conexión

False sharing references and power management can trigger wide latency spreads, but are neither directly observable nor easily traced to causes. This talk describes how to diagnose the problems quickly, and outlines several remedies.

False sharing references and power management can trigger wide latency spreads, but are neither directly observable nor easily traced to causes. This talk describes how to diagnose the problems quickly, and outlines several remedies.

Anuncio
Anuncio

Más Contenido Relacionado

Similares a Automating the Hunt for Non-Obvious Sources of Latency Spreads (20)

Más de ScyllaDB (20)

Anuncio

Más reciente (20)

Automating the Hunt for Non-Obvious Sources of Latency Spreads

  1. 1. Automating the Hunt for Non-Obvious Sources of Latency Spreads Kshitij Doshi, Sr. Principal Engr at Intel Harshad S. Sane, Principal Engr at Intel Datacenter & AI
  2. 2. Kshitij Doshi ■ Ph.D., Rice Univ – Comm. efficient parallel algorithms ■ Performance of Systems, DB, Cloud-native apps ■ Research interests in storage, memory, distributed systems ■ 20 y at Intel; previously,13 y at Unix Systems Labs & Novell. Datacenter & AI
  3. 3. Harshad Sane ■ Harshad Sane is a Principal Engineer in Intel's Data Center and AI group ■ Deep technical expertise in system software, memory, and CPU architectures. ■ Specializes in Performance Engineering with extensive experience and expertise in Telemetry, Observability, Monitoring, Software optimization. Datacenter & AI
  4. 4. ■ Section 1 - About tail latency spreads ■ Section 2 - Two non-obvious causes of latency escapes ■ Section 3 - How to decide if either of them are hurting your application ■ Section 4 - Mitigations, if they are hurting your application ■ Section 5 - Summary Agenda
  5. 5. Hurdles are not always predictable. Courtesy: bing.com/images
  6. 6. ScyllaDB is engineered for usages needing high throughputs and predictable, low latencies... https://resources.scylladb.com/videos/build-low-latency-applications-in-rust-on-scylladb Query Commitlog Compaction Queue Queue Queue 0.5 msec Userspace I/O Scheduler Disk
  7. 7. Frontend Database Services 3-tier architecture . . . Latency landmines can be present, however, in other layers and inter-services interactions, or, in infrastructure services Microservices architecture
  8. 8. frequently there is some issue that intersects in an unpredictable manner with execution of normal hotspots When Small Performance Fluctuations Magnify Into Sudden, Large Spikes in Response Times…
  9. 9. repeating over and over with minor perturbations in end to end latencies for each itearation Consider a Streamlined Flow of Execution
  10. 10. Such a hiccup . . . propagates and throws both timing and resource usage out of balance, for some period of time. But this period of non-streamlined flow can feed on itself and produce secondary spikes in end-to-end latencies, even as overall flow throughput evens out. Where Something Goes Out of Balance Momentarily and Causes a Hiccup.
  11. 11. Consider two such issues . . .
  12. 12. Wait 100 ms T1 T3 T2 T4 T5 T0 : set random seed S use S use S use S use S use S Producer Consumer Get Y from queue L Put X in queue L X Y Producer frequently modifies tail of queue L, while consumer frequently modifies head of queue L. A first module in an application A second module in the application First issue:
  13. 13. Wait 100 ms T1 T3 T2 T4 T5 T0 : set random seed S use S use S use S use S use S Producer Consumer Get Y from queue L Put X in queue L X Y Producer frequently modifies tail of queue L, while consumer frequently modifies head of queue L. A first module in an application A second module in the application The threads working in the two modules, which have no logical intersection, do however get cross-coupled if the variable S ends up on a cacheline that is also used for storing either or both of the head / tail pointers of queue L. Not a significant problem unless updates of queue L become frequent. FIrst issue: false sharing
  14. 14. High CPU frequency High power consumption More instructions/time Low CPU frequency Low power consumption Fewer instructions/time One of several sleep states Very low power consumption Active idle Operating system and hardware algorithms together with system configuration parameters determine conditions under which CPUs transition among different states of operation Second Issue: CPU Active (P-states) and Sleep (C-states) States
  15. 15. High CPU frequency High power consumption More instructions/time Low CPU frequency Low power consumption Fewer instructions/time One of several sleep states Very low power consumption Active idle Transitions out of deeper sleep states into active execution take many microseconds upto the point of normal instruction execution, and also experience transient effects of colder caches. Second Issue: CPU Power Management Transitions
  16. 16. High CPU frequency High power consumption More instructions/time Low CPU frequency Low power consumption Fewer instructions/time One of several sleep states Very low power consumption Active idle Transitions out of deeper sleep states into active execution take many microseconds upto the point of normal instruction execution, and also experience transient effects of colder caches. Transitions from low power states to normal execution go through a series of frequency step-ups, causing software actions that are dependency-chained, to stall due to inter-thread or inter-process data/event waiting. Second issue: CPU Power (and Sleep) State Transitions
  17. 17. ThrA ThrB ThrC trigger ThrB trigger ThrC trigger ThrA trigger ThrB Ideal progression . . . Time
  18. 18. ThrA ThrB ThrC Imagine that this CPU is transitioning out of a deep sleep state . . .
  19. 19. ThrA ThrB ThrC Imagine that this CPU is transitioning out of a deep sleep state . . . Delayed trigger for ThrA ThrC runs slower Possibly this CPU has entered a deeper sleep as a result ThrA runs slower as a result Delayed trigger for ThrB Cascading Sleep -> Wakeup -> Sleep transients can take up time to fade out, and cause high peak latencies … … even though impact on average latency gets amortized.
  20. 20. Detecting and untangling the causes of these intersections of issues is very challenging. Particularly if a high degree of instrumentation, such as tracing or logging interferes with and distorts the effects. These effects are not easily noticed through lightweight sampling or counting of events
  21. 21. Collecting Traces Security, Collection overheads at CPU, Caches, Bandwidth to memory and storage/network. Analyzing Traces Like searching for a needle in multiple haystacks without knowing if a needle is to be found at all. Scheduling collection and analysis Like figuring out when crime is going to occur in order to launch crime scene analysis Challenges
  22. 22. • Turbostat • Powertop • Runqueue lengths Indicating Power transitions • Sharp IPC drop with concurrency • No obvious mem/disk data bottleneck • High utilization, low runqueue lengths Indicating cache coherence issues Good but circumstantial clues • CoreFreq - https://github.com/cyring/CoreFreq Correlated Events That Can Be Collected at Low Overhead
  23. 23. Picking up on disruptive events Courtesy: bing.com/images
  24. 24. To see what is available: sudo perf list | grep cstate Example output: cstate_core/c3-residency/ cstate_core/c6-residency/ cstate_core/c7-residency/ cstate_pkg/c2-residency/ … Capturing C-State Transitions
  25. 25. sudo perf timechart record sudo perf timechart Monitoring sleep, wait, and run times per CPU. 4 threads in intermittent sleeps 4 threads in variable I/O wait durations Timecharting
  26. 26. Per CPU timeline of processes as they are context switched 2. perf sched map Visualizing scheduling events by time 1. perf sched timehist (-Mw): migrations and wakeups Tracks Scheduler latency by event, including time to wakeup , latency from wakeup to run (sched delay)
  27. 27. P-states (≡ frequencies) ‒ BIOS controlled ‒ OS controlled via scaling drivers ‒ HW controlled P states ‒ Turbo Monitoring and controlling P-states Credits:https://images.anandtech.com/doci/9582/43.jpg To monitor the P-states: ‒ Turbostat, CoreFreq ‒ Profiling tools (Perf, Vtune, etc.) To control P-states: OS controlled P-states manually configured through scaling drivers. such as ‒ Cpupower ‒ CoreFreq https://github.com/cyring/CoreFreq to control performance governor for P states.
  28. 28. Response time R --monitored at application level Exp-weighted Moving Window Avg Short-range average Detect upward heave C-State Monitoring P-State Monitoring Tn = Transitions count over 250ms windows Tn > Threshold Detect Overlap Snapshot runqlat and timechart activity from last ‘n’ secs
  29. 29. (a) Likelihood with higher concurrency + low scaling + higher response times, with low IPC despite low LLC misses/instruction (b) Sensitivity insensitive to runq-lengths (not sensitive to CPU subscription) (c) Clues higher number of coherence misses in L1 and/or L2 (PMC snoop events S2I, M2I) increased inter-socket link utilization in a multi-socket system Step 1: Establish whether sufficient clues exist to suspect false sharing Clues for False-Sharing (With Low Overhead)
  30. 30. Drilling down for concrete evidence of false sharing perf c2c Sampling based detection of cachelines where false sharing was likely – based on the HITM event (see below). These are read or write accesses for which a different core’s cache reports a “hit” in “modified” state (HITM). Provides insights into data addresses, code addresses, processes and threads that generate sharing conflicts. Conditionally upon step 1 indicating possible false sharing Step 2: Collect perf c2c profiles identifying data and code addresses producing the contention
  31. 31. Cacheline summary Cacheline Access details (look for HITMs) What perf c2c profiling looks like
  32. 32. Courtesy: bing.com/images
  33. 33. When power management actions are suspected to provoke high tail latencies: 1. Choose less extreme power-performance settings power-save, energy-efficient etc. 1. Explore changes in scheduler tunings, such as – a. Quicker preemption (reducing wakeup -> onproc) b. Smaller time-slices c. Different (usually lower) migration thresholds Solution Space
  34. 34. When false-sharing is suspected to provoke high tail latencies: 1. Some data structure layout possibilities: a. Data structure / global variable padding (if possible) b. Changing the affected data structure to better separate (quasi)- immutable from mutable cachelines c. Splitting data structures in question into sub-structures 2. Possible computation strategy changes: a. Rate-limiting writers to cachelines that are accessed frequently by readers b. Colocating to same socket or sub-numa clusters c. Make code bimodal: normal computation until a monitor signals rise in coherence events, and one of (2a/2b) after. Solution Space
  35. 35. • Latency instrumentation needs to be made as close to real time as possible. • Tracing needs to be combined with sampling over short intervals and, triggered by good precursors so overhead is kept to a minimum. • We outlined two issues— • False sharing • Power management transitions that may not arise frequently, but can have measurable effects on tail latencies, which can be hard to detect. In this presentation we have shown the role these 2 components play in application performance, their detectability and possible solutions. Summary
  36. 36. Thank You Stay in Touch kshitij.a.doshi@intel.com harshad.s.sane@intel.com

×