Computer Performance
Microscopy with SHIM
Kathryn McKinley
Microsoft Research
1
Steve Blackburn
Australian National Univer...
2
4 μops
Intel i7-4770, 3.4 GHz
0
0.5
1
1.5
2
2.5
3
3.5
4
IPC
Benchmark IPC
3
Lusearch is a DaCapo benchmark based on
the widely used open source search e...
Interrupt Driven Profilers
4
Sampling at default 1 KHz, maximum 100 KHz.
Method IPC
Lusearch
5
top 10 methods (74% total execution time)
IPC
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 3 4 5 6 7 8 9 10
d...
Sampling IPC
6
time
Two counters: C – cycles, R - retired instructions
R0 C0
IPC1 IPC2 IPC3
R1 C1 R2 C2 R3 C3
IPC = (Rt – ...
0
0.5
1
1.5
2
2.5
Sampling Lusearch IPC
7
SHIM 10 MHz
maximum 100 KHz
default 1 KHz
0
0.5
1
1.5
2
2.5
0
0.5
1
1.5
2
2.5
IP...
#define DEFAULT_MAX_SAMPLE_RATE 100000
/*
* perf samples are done in some very critical code paths (NMIs).
* If they take ...
insight
9
Hardware and Software
Generate Signals
10
hardware signals software signals
hardware
performance counters
A (x){
x.y = B()...
Signals
11
hardware signals software signals
hardware software
counters
tags
✓
✓
✓
✓
12
Observe Signals From
Another Hardware Context
SHIM design
13
Observe Global Counters
14
LLC misses per cycle
while (true):
for counter in LLC misses, cycles:
buf[i++] = readCounter(co...
0
4
15
while (true):
for counter in HT2 SHIM, Core, Cycles:
buf[i++] = readCounter(counter);
HT1
HT1 IPC
0
4
Core IPC
0
4
...
Correlate Hardware and Software Signals
16
while (true):
for counter in HT2 SHIM, Core, cycles:
buf[i++] = readCounter(cou...
Fidelity
17
Raw Samples
18
IPC (log scale)
% of
samples
(log scale)
Problem: Samples Are Not Atomic
19
time
Counters: C – cycles, R - retired
instructions
R0 C0
IPC1 IPC2 IPC3
R1 C1 R2 C2 R3...
Solution: Use Clock As Ground Truth
20
time
Cs
0R0C0Ce
0
IPC1 IPC2 IPC3
Cs
1R1C1Ce
1 Cs
2R2C2Ce
2 Cs
3R3C3Ce
3
✗✓ ✓
CPC1 =...
Filter Lusearch Samples
21
---- raw IPC
%ofsamples(logscale)
---- raw CPC
---- filtered IPC
---- filtered CPC in [0.99,1.0...
overheads
22
Software Signal
Other Core
23
0
0.5
1
1.5
2
2.5
3
3.5
4
30 cycles 1213 cycles
observe method and loop IDs.
Normalizedtowit...
Software Signal
Same Core
24
0
0.5
1
1.5
2
2.5
15 cycles 1505 cycles
NormalizedtowithoutSHIM
Overheads are from sharing th...
Hardware and Software Signals
Same Core
25
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
495 cycles
Correlate IPC with method and lo...
Reducing Overheads
• Bursty sampling
• SMT priorities
• Heterogeneous multicore
• Globally visible per-thread performance
...
Conclusion
• High frequency sampling is important
• SHIM observes signals directly, low overhead
• Cycles per cycle filter...
Backup Slides
28
100 KHz (10 μs)
High or low ?
29
10 μs is not bad
30
10 μs is not bad?
31
25 μs!Simple Address Book
*Name: Xi YANG
*Email: xi.yang@anu.edu.au
100 KHz (10 μs) won’t see this
32
The 25 μs life of the
address_book.SerializeToOstream(&output).
Sampling at 5 MHz, 608
c...
Próxima SlideShare
Cargando en…5
×

Computer Performance Microscopy with SHIM

467 visualizaciones

Publicado el

  • Sé el primero en comentar

Computer Performance Microscopy with SHIM

  1. 1. Computer Performance Microscopy with SHIM Kathryn McKinley Microsoft Research 1 Steve Blackburn Australian National University Xi Yang Australian National University
  2. 2. 2 4 μops Intel i7-4770, 3.4 GHz
  3. 3. 0 0.5 1 1.5 2 2.5 3 3.5 4 IPC Benchmark IPC 3 Lusearch is a DaCapo benchmark based on the widely used open source search engine framework Lucene. Plenty of room here!
  4. 4. Interrupt Driven Profilers 4 Sampling at default 1 KHz, maximum 100 KHz.
  5. 5. Method IPC Lusearch 5 top 10 methods (74% total execution time) IPC 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 9 10 default 1 KHz maximum 100 KHz SHIM 10 MHz
  6. 6. Sampling IPC 6 time Two counters: C – cycles, R - retired instructions R0 C0 IPC1 IPC2 IPC3 R1 C1 R2 C2 R3 C3 IPC = (Rt – Rt-1) / (Ct – Ct-1) IPC is a high frequency signal.
  7. 7. 0 0.5 1 1.5 2 2.5 Sampling Lusearch IPC 7 SHIM 10 MHz maximum 100 KHz default 1 KHz 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 IPC IPC IPC
  8. 8. #define DEFAULT_MAX_SAMPLE_RATE 100000 /* * perf samples are done in some very critical code paths (NMIs). * If they take too much CPU time, the system can lock up and not * get any real work done. This will drop the sample rate when profilers SHIM simulators HiFi handy online ✓✗ ✓ ✗ ✗ ✓✓ ✓ ✓ 8
  9. 9. insight 9
  10. 10. Hardware and Software Generate Signals 10 hardware signals software signals hardware performance counters A (x){ x.y = B(); x.z = C(); } A() B() C() time memory locations
  11. 11. Signals 11 hardware signals software signals hardware software counters tags ✓ ✓ ✓ ✓
  12. 12. 12 Observe Signals From Another Hardware Context
  13. 13. SHIM design 13
  14. 14. Observe Global Counters 14 LLC misses per cycle while (true): for counter in LLC misses, cycles: buf[i++] = readCounter(counter)
  15. 15. 0 4 15 while (true): for counter in HT2 SHIM, Core, Cycles: buf[i++] = readCounter(counter); HT1 HT1 IPC 0 4 Core IPC 0 4 HT2 SHIM IPC HT1 IPC = Core IPC – HT2 SHIM IPC HT2 Observe Local Counters
  16. 16. Correlate Hardware and Software Signals 16 while (true): for counter in HT2 SHIM, Core, cycles: buf[i++] = readCounter(counter); tid = thread on HT1 buf[i++] = tid.method; 0 1 2 3 4 HT1 IPC 0 1 2 3 4 Core IPC 0 1 2 3 4 HT2 SHIM IPC 1 2 3 A() B() C() HT1 HT2 HT1 stack
  17. 17. Fidelity 17
  18. 18. Raw Samples 18 IPC (log scale) % of samples (log scale)
  19. 19. Problem: Samples Are Not Atomic 19 time Counters: C – cycles, R - retired instructions R0 C0 IPC1 IPC2 IPC3 R1 C1 R2 C2 R3 C3 IPC = (Rt – Rt-1) / (Ct – Ct-1) ✗✓ ✓
  20. 20. Solution: Use Clock As Ground Truth 20 time Cs 0R0C0Ce 0 IPC1 IPC2 IPC3 Cs 1R1C1Ce 1 Cs 2R2C2Ce 2 Cs 3R3C3Ce 3 ✗✓ ✓ CPC1 = 1.0 +/- 1% CPC2 = 1.0 +/- 1% CPC3 != 1.0 +/- 1% CPC = (Ce t – Ce t-1) / (Cs t – Cs t-1) this should be 1! while (true): buf[i++] = readCycle();// read Cs for counter in HT2 SHIM, Core, cycles: buf[i++] = readCounter(counter); buf[i++] = readCycle();// read Ce tid = thread on HT1 buf[i++] = tid.method;
  21. 21. Filter Lusearch Samples 21 ---- raw IPC %ofsamples(logscale) ---- raw CPC ---- filtered IPC ---- filtered CPC in [0.99,1.01]
  22. 22. overheads 22
  23. 23. Software Signal Other Core 23 0 0.5 1 1.5 2 2.5 3 3.5 4 30 cycles 1213 cycles observe method and loop IDs. NormalizedtowithoutSHIM Overheads are from write invalidate transactions. 3MHz: more than an order of magnitude better than ‘maximum’ 113MHz: more than three orders of magnitude better than ‘maximum’
  24. 24. Software Signal Same Core 24 0 0.5 1 1.5 2 2.5 15 cycles 1505 cycles NormalizedtowithoutSHIM Overheads are from sharing the core resources. observe method and loop IDs.
  25. 25. Hardware and Software Signals Same Core 25 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 495 cycles Correlate IPC with method and loop IDs. NormalizedtowithoutSHIM
  26. 26. Reducing Overheads • Bursty sampling • SMT priorities • Heterogeneous multicore • Globally visible per-thread performance counters 26
  27. 27. Conclusion • High frequency sampling is important • SHIM observes signals directly, low overhead • Cycles per cycle filters samples • Opportunities for hardware analysis • Opportunities for hardware design 27 Questions? https://github.com/ShimProfiler/SHIM
  28. 28. Backup Slides 28
  29. 29. 100 KHz (10 μs) High or low ? 29
  30. 30. 10 μs is not bad 30
  31. 31. 10 μs is not bad? 31 25 μs!Simple Address Book *Name: Xi YANG *Email: xi.yang@anu.edu.au
  32. 32. 100 KHz (10 μs) won’t see this 32 The 25 μs life of the address_book.SerializeToOstream(&output). Sampling at 5 MHz, 608 cycles

×