SlideShare una empresa de Scribd logo
1 de 88
Descargar para leer sin conexión
@kuksenk0#JavaPerfAndHWC
Java Performance:
Speedup your applications with hardware counters
Sergey Kuksenko
@kuksenk0
@kuksenk0#JavaPerfAndHWC
The following is intended to outline our general product
direction. It is intended for information purposes only, and may
not be incorporated into any contract. It is not a commitment
to deliver any material, code, or functionality, and should not be
relied upon in making purchasing decisions. The development,
release, and timing of any features or functionality described for
Oracle’s products remains at the sole discretion of Oracle.
2/64
@kuksenk0#JavaPerfAndHWC
Intriguing Introductory Example
Gotcha, here is my hot method
or
On Algorithmic O⋆
ptimizations
⋆
just-O
3/64
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return result;
}
4/64
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return result;
}
𝑂( 𝑁3)⋆
⋆
big-O
4/64
@kuksenk0#JavaPerfAndHWC
Example 1: "Ok Google"
𝑂( 𝑁2.81)
5/64
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
N multiply strassen
32 33 𝜇𝑠
64 252 𝜇𝑠
128 3019 𝜇𝑠 2119 𝜇𝑠
256 56 𝑚𝑠 15 𝑚𝑠
512 488 𝑚𝑠 111 𝑚𝑠
1024 3270 𝑚𝑠 785 𝑚𝑠
2048 32 𝑠 6 𝑠
4096 309 𝑠 42 𝑠
8192 2611 𝑠 293 𝑠
time/op
6/64
@kuksenk0#JavaPerfAndHWC
Try the other way
7/64
@kuksenk0#JavaPerfAndHWC
Example 1: JMH, -prof perfnorm, N=256⋆
per multiplication per iteration
CPI 1.06
cycles 146 × 106
8.7
instructions 137 × 106
8.2
L1-loads 68 × 106
4
L1-load-misses 33 × 106
2
L1-stores 0.56 × 106
L1-store-misses 38357
LLC-loads 26 × 106
1.6
LLC-stores 11595
⋆
∼ 256Kb per matrix
8/64
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j] ;
}
R[i][j] = s;
}
}
return result;
}
L1-load-misses
9/64
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j] ;
}
R[i][j] = s;
}
}
return result;
}
9/64
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiplyIKJ(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int k = 0; k < size; k++) {
int aik = A[i][k];
for (int j = 0; j < size; j++) {
R[i][j] += aik * B[k][j];
}
}
}
return result;
}
10/64
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
N multIJK strassenIJK multIKJ
32 33 𝜇𝑠 13 𝜇𝑠
64 252 𝜇𝑠 80 𝜇𝑠
128 3019 𝜇𝑠 2119 𝜇𝑠 464 𝜇𝑠
256 56 𝑚𝑠 15 𝑚𝑠 4 𝑚𝑠
512 488 𝑚𝑠 111 𝑚𝑠 29 𝑚𝑠
1024 3270 𝑚𝑠 785 𝑚𝑠 369 𝑚𝑠
2048 32 𝑠 6 𝑠 3.4 𝑠
4096 309 𝑠 42 𝑠 25 𝑠
8192 2611 𝑠 293 𝑠 210 𝑠
time/op
11/64
@kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
N multIJK strassenIJK multIKJ strassenIKJ
32 33 𝜇𝑠 13 𝜇𝑠
64 252 𝜇𝑠 80 𝜇𝑠
128 3019 𝜇𝑠 2119 𝜇𝑠 464 𝜇𝑠
256 56 𝑚𝑠 15 𝑚𝑠 4 𝑚𝑠
512 488 𝑚𝑠 111 𝑚𝑠 29 𝑚𝑠 25 𝑚𝑠
1024 3270 𝑚𝑠 785 𝑚𝑠 369 𝑚𝑠 192 𝑚𝑠
2048 32 𝑠 6 𝑠 3.4 𝑠 1.4 𝑠
4096 309 𝑠 42 𝑠 25 𝑠 10 𝑠
8192 2611 𝑠 293 𝑠 210 𝑠 73 𝑠
time/op
11/64
@kuksenk0#JavaPerfAndHWC
Example 1: JMH, -prof perfnorm, N=256
IJK IKJ
multiply iteration multiply iteration
CPI 1.06 0.51
cycles 146 × 106
8.7 9.7 × 106
0.6
instructions 137 × 106
8.2 19 × 106
1.1
L1-loads 68 × 106
4 5.4 × 106
0.3
L1-load-misses 33 × 106
2 1.1 × 106
0.1
L1-stores 0.56 × 106
2.7 × 106
0.2
L1-store-misses 38357 8959
LLC-loads 26 × 106
1.6 0.3 × 106
LLC-stores 11595 3532
12/64
@kuksenk0#JavaPerfAndHWC
Example 1: Free beer benefits!
cycles insts
...
6.42% 5.54% 0x00007ff6591d20c5: vmovdqu %ymm1 ,0x10(%r14 ,%rbx ,4) ;* iastore
9.49% 11.91% 0x00007ff6591d20cc: add $0x8 ,%ebx ;*iinc
0.15% 0.05% 0x00007ff6591d20cf: cmp %r11d ,%ebx
0x00007ff6591d20d2: jl 0x00007ff6591d20b2 ;* if_icmpge
...
13/64
@kuksenk0#JavaPerfAndHWC
Example 1: Free beer benefits!
cycles insts
...
6.42% 5.54% 0x00007ff6591d20c5: vmovdqu %ymm1 ,0x10(%r14 ,%rbx ,4) ;* iastore
9.49% 11.91% 0x00007ff6591d20cc: add $0x8 ,%ebx ;*iinc
0.15% 0.05% 0x00007ff6591d20cf: cmp %r11d ,%ebx
0x00007ff6591d20d2: jl 0x00007ff6591d20b2 ;* if_icmpge
...
Vectorization (SSE/AVX)!
13/64
@kuksenk0#JavaPerfAndHWC
Chapter 1
Performance Optimization Methodology:
A Very Short Introduction.
Three magic questions and the direction of the
journey
14/64
@kuksenk0#JavaPerfAndHWC
Three magic questions
15/64
@kuksenk0#JavaPerfAndHWC
Three magic questions
What?
∙ What prevents my application to work faster?
(monitoring)
15/64
@kuksenk0#JavaPerfAndHWC
Three magic questions
What? ⇒ Where?
∙ What prevents my application to work faster?
(monitoring)
∙ Where does it hide?
(profiling)
15/64
@kuksenk0#JavaPerfAndHWC
Three magic questions
What? ⇒ Where? ⇒ How?
∙ What prevents my application to work faster?
(monitoring)
∙ Where does it hide?
(profiling)
∙ How to stop it messing with performance?
(tuning/optimizing)
15/64
@kuksenk0#JavaPerfAndHWC
Top-Down Approach
∙ System Level
– Network, Disk, OS, CPU/Memory
∙ JVM Level
– GC/Heap, JIT, Classloading
∙ Application Level
– Algorithms, Synchronization, Threading, API
∙ Microarchitecture Level
– Caches, Data/Code alignment, CPU Pipeline Stalls
16/64
@kuksenk0#JavaPerfAndHWC
Methodology in Essence
17/64
@kuksenk0#JavaPerfAndHWC
Methodology in Essence
http://j.mp/PerfMindMap
17/64
@kuksenk0#JavaPerfAndHWC
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
18/64
@kuksenk0#JavaPerfAndHWC
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
∙ Lots of %sys ⇒ . . .
⇓
18/64
@kuksenk0#JavaPerfAndHWC
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
∙ Lots of %sys ⇒ . . .
⇓
∙ Lots of %irq, %soft ⇒ . . .
⇓
18/64
@kuksenk0#JavaPerfAndHWC
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
∙ Lots of %sys ⇒ . . .
⇓
∙ Lots of %irq, %soft ⇒ . . .
⇓
∙ Lots of %iowait ⇒ . . .
⇓
18/64
@kuksenk0#JavaPerfAndHWC
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
∙ Lots of %sys ⇒ . . .
⇓
∙ Lots of %irq, %soft ⇒ . . .
⇓
∙ Lots of %iowait ⇒ . . .
⇓
∙ Lots of %idle ⇒ . . .
⇓
18/64
@kuksenk0#JavaPerfAndHWC
Let’s go (Top-Down)
Do system monitoring (e.g. mpstat)
∙ Lots of %sys ⇒ . . .
⇓
∙ Lots of %irq, %soft ⇒ . . .
⇓
∙ Lots of %iowait ⇒ . . .
⇓
∙ Lots of %idle ⇒ . . .
⇓
∙ Lots of %user
18/64
@kuksenk0#JavaPerfAndHWC
Chapter 2
High CPU Load
and
the main question:
«who/what is to blame?»
19/64
@kuksenk0#JavaPerfAndHWC
CPU Utilization
∙ What does ∼100% CPU Utilization mean?
20/64
@kuksenk0#JavaPerfAndHWC
CPU Utilization
∙ What does ∼100% CPU Utilization mean?
– OS has enough tasks to schedule
20/64
@kuksenk0#JavaPerfAndHWC
CPU Utilization
∙ What does ∼100% CPU Utilization mean?
– OS has enough tasks to schedule
Can profiling help?
20/64
@kuksenk0#JavaPerfAndHWC
CPU Utilization
∙ What does ∼100% CPU Utilization mean?
– OS has enough tasks to schedule
Can profiling help?
Profiler
⋆
shows «WHERE» application time is spent,
but there’s no answer to the question «WHY».
⋆
traditional profiler
20/64
@kuksenk0#JavaPerfAndHWC
Who/What is to blame?
Complex CPU microarchitecture:
∙ Inefficient algorithm ⇒ 100% CPU
∙ Pipeline stall due to memory load/stores ⇒ 100% CPU
∙ Pipeline flush due to mispredicted branch ⇒ 100% CPU
∙ Expensive instructions ⇒ 100% CPU
∙ Insufficient ILP
⋆
⇒ 100% CPU
∙ etc. ⇒ 100% CPU
⋆
Instruction Level Parallelism
21/64
@kuksenk0#JavaPerfAndHWC
Chapter 3
Hardware Counters
HWC, PMU - WTF?
22/64
@kuksenk0#JavaPerfAndHWC
PMU: Performance Monitoring Unit
Performance Monitoring Unit - profiles hardware activity,
built into CPU.
23/64
@kuksenk0#JavaPerfAndHWC
PMU: Performance Monitoring Unit
Performance Monitoring Unit - profiles hardware activity,
built into CPU.
PMU Internals (in less than 21 seconds) :
Hardware counters (HWC) count hardware performance events
(performance monitoring events)
23/64
@kuksenk0#JavaPerfAndHWC
Events
Vendor’s documentation! (e.g. 2 pages from 32)
24/64
@kuksenk0#JavaPerfAndHWC
Events: Issues
∙ Hundreds of events
∙ Microarchitectural experience is required
∙ Platform dependent
– vary from CPU vendor to vendor
– may vary when CPU manufacturer introduces new
microarchitecture
∙ How to work with HWC?
25/64
@kuksenk0#JavaPerfAndHWC
HWC
HWC modes:
∙ Counting mode
– if (event_happened) ++counter;
– general monitoring (answering «WHAT?» )
∙ Sampling mode
– if (event_happened)
if (++counter < threshold) INTERRUPT;
– profiling (answering «WHERE?»)
26/64
@kuksenk0#JavaPerfAndHWC
HWC: Issues
So many events, so few counters
e.g. “Nehalem”:
– over 900 events
– 7 HWC (3 fixed + 4 programmable)
∙ «multiple-running» (different events)
– Repeatability
∙ «multiplexing» (only a few tools are able to do that)
– Steady state
27/64
@kuksenk0#JavaPerfAndHWC
HWC: Issues (cont.)
Sampling mode:
∙ «instruction skid»
(hard to correlate event and instruction)
∙ Uncore events
(hard to bind event and execution thread;
e.g. shared L3 cache)
28/64
@kuksenk0#JavaPerfAndHWC
HWC: typical usages
∙ hardware validation
∙ performance analysis
29/64
@kuksenk0#JavaPerfAndHWC
HWC: typical usages
∙ hardware validation
∙ performance analysis
∙ run-time tuning (e.g. JRockit, etc.)
∙ security attacks and defenses
∙ test code coverage
∙ etc.
29/64
@kuksenk0#JavaPerfAndHWC
HWC: tools
∙ Oracle Solaris Studio Performance Analyzer
(http://www.oracle.com/technetwork/server-storage/solarisstudio)
∙ perf/perf_events (http://perf.wiki.kernel.org)
∙ JMH
(http://openjdk.java.net/projects/code-tools/jmh/)
-prof perf
-prof perfnorm = perf, normalized per operation
-prof perfasm = perf + -XX:+PrintAssembly
30/64
@kuksenk0#JavaPerfAndHWC
HWC: tools (cont.)
∙ AMD CodeXL
∙ Intel Vtune Amplifier XE
∙ etc. . .
31/64
@kuksenk0#JavaPerfAndHWC
perf events (e.g.)
∙ cycles
∙ instrustions
∙ cache-references
∙ cache-misses
∙ branches
∙ branch-misses
∙ bus-cycles
∙ ref-cycles
∙ dTLB-loads
∙ dTLB-load-misses
∙ L1-dcache-loads
∙ L1-dcache-load-misses
∙ L1-dcache-stores
∙ L1-dcache-store-misses
∙ LLC-loads
∙ etc...
32/64
@kuksenk0#JavaPerfAndHWC
Oracle Studio events (e.g.)
∙ cycles
∙ insts
∙ branch-instruction-retired
∙ branch-misses-retired
∙ dtlbm
∙ l1h
∙ l1m
∙ l2h
∙ l2m
∙ l3h
∙ l3m
∙ etc...
33/64
@kuksenk0#JavaPerfAndHWC
Chapter 4
I’ve got HWC data. What’s next?
or
Introduction to microarchitecture
performance analysis
34/64
@kuksenk0#JavaPerfAndHWC
Execution time
𝑡𝑖𝑚𝑒 = 𝑐𝑦𝑐𝑙𝑒𝑠
𝑓 𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Optimization is. . .
reducing the (spent) cycle count!
⋆
⋆
everything else is overclocking
35/64
@kuksenk0#JavaPerfAndHWC
Microarchitecture Equation
𝑐𝑦𝑐𝑙𝑒𝑠 = 𝑃 𝑎𝑡ℎ𝐿𝑒𝑛𝑔𝑡ℎ * 𝐶𝑃 𝐼 = 𝑃 𝑎𝑡ℎ𝐿𝑒𝑛𝑔𝑡ℎ * 1
𝐼𝑃 𝐶
∙ PathLength - number of instructions
∙ CPI - cycles per instruction
∙ IPC - instructions per cycle
36/64
@kuksenk0#JavaPerfAndHWC
𝑃 𝑎𝑡ℎ𝐿𝑒𝑛𝑔𝑡ℎ * 𝐶𝑃 𝐼
∙ PathLength ∼ algorithm efficiency (the smaller the better)
∙ CPI ∼ CPU efficiency (the smaller the better)
– 𝐶𝑃 𝐼 = 4 – bad!
– 𝐶𝑃 𝐼 = 1
∙ Nehalem – just good enough!
∙ SandyBridge and later – not so good!
– 𝐶𝑃 𝐼 = 0.4 – good!
– 𝐶𝑃 𝐼 = 0.2 – ideal!
37/64
@kuksenk0#JavaPerfAndHWC
What to do?
∙ low CPI
– reduce PathLength → «tune algorithm»
∙ high CPI ⇒ stalls
– memory stalls → «tune data structures»
– branch stalls → «tune control logic»
– instruction dependency → «break dependency chains»
– long latency ops → «use more simple operations»
– etc. . .
38/64
@kuksenk0#JavaPerfAndHWC
High CPI: Memory bound
∙ dTLB misses
∙ L1,L2,L3,...,LN misses
∙ NUMA: non-local memory access
∙ memory bandwidth
∙ false/true sharing
∙ cache line split (not in Java world, except. . . )
∙ store forwarding (unlikely, hard to fix on Java level)
∙ 4k aliasing
39/64
@kuksenk0#JavaPerfAndHWC
High CPI: Core bound
∙ long latency operations: DIV, SQRT
∙ FP assist: floating points denormal, NaN, inf
∙ bad speculation (caused by mispredicted branch)
∙ port saturation (forget it)
40/64
@kuksenk0#JavaPerfAndHWC
High CPI: Front-End bound
∙ iTLB miss
∙ iCache miss
∙ branch mispredict
∙ LSD (loop stream decoder)
solvable by HotSpot tweaking
41/64
@kuksenk0#JavaPerfAndHWC @kuksenk0#JavaPerfAndHWC
Exam
p
les
@kuksenk0#JavaPerfAndHWC
Example 2
Some «large» standard server-side
Java benchmark
43/64
@kuksenk0#JavaPerfAndHWC
Example 2: perf stat -d
880851.993237 task -clock (msec)
39,318 context -switches
437 cpu -migrations
7,931 page -faults
2 ,277 ,063 ,113 ,376 cycles
1 ,226 ,299 ,634 ,877 instructions # 0.54 insns per cycle
229 ,500 ,265 ,931 branches # 260.544 M/sec
4 ,620 ,666 ,169 branch -misses # 2.01% of all branches
338 ,169 ,489 ,902 L1-dcache -loads # 383.912 M/sec
37 ,937 ,596 ,505 L1-dcache -load -misses # 11.22% of all L1-dcache hits
25 ,232 ,434 ,666 LLC -loads # 28.645 M/sec
7 ,307 ,884 ,874 L1-icache -load -misses # 0.00% of all L1 -icache hits
337 ,730 ,278 ,697 dTLB -loads # 382.846 M/sec
6 ,094 ,356 ,801 dTLB -load -misses # 1.81% of all dTLB cache hits
12 ,210 ,841 ,909 iTLB -loads # 13.863 M/sec
431 ,803 ,270 iTLB -load -misses # 3.54% of all iTLB cache hits
301.557213044 seconds time elapsed
44/64
@kuksenk0#JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54
cycles ×109
2277
instructions ×109
1226
L1-dcache-loads ×109
338
L1-dcache-load-misses ×109
38
LLC-loads ×109
25
dTLB-loads ×109
338
dTLB-load-misses ×109
6
45/64
@kuksenk0#JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54
cycles ×109
2277
instructions ×109
1226
L1-dcache-loads ×109
338
L1-dcache-load-misses ×109
38
LLC-loads ×109
25
dTLB-loads ×109
338
dTLB-load-misses ×109
6
Potential Issues?
45/64
@kuksenk0#JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54
cycles ×109
2277
instructions ×109
1226
L1-dcache-loads ×109
338
L1-dcache-load-misses ×109
38
LLC-loads ×109
25
dTLB-loads ×109
338
dTLB-load-misses ×109
6
Let’s count:
4 × (338 × 109
)+
12 × ((38 − 25) × 109
)+
36 × (25 × 109
) ≈
2408 × 109
cycles ???
Potential Issues?
45/64
@kuksenk0#JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54
cycles ×109
2277
instructions ×109
1226
L1-dcache-loads ×109
338
L1-dcache-load-misses ×109
38
LLC-loads ×109
25
dTLB-loads ×109
338
dTLB-load-misses ×109
6
Let’s count:
4 × (338 × 109
)+
12 × ((38 − 25) × 109
)+
36 × (25 × 109
) ≈
2408 × 109
cycles ???
Superscalar in Action!
45/64
@kuksenk0#JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54
cycles ×109
2277
instructions ×109
1226
L1-dcache-loads ×109
338
L1-dcache-load-misses ×109
38
LLC-loads ×109
25
dTLB-loads ×109
338
dTLB-load-misses ×109
6
Issue!
45/64
@kuksenk0#JavaPerfAndHWC
Example 2: dTLB misses
TLB = Translation Lookaside Buffer
∙ a memory cache that stores recent translations of virtual
memory addresses to physical addresses
∙ each memory access → access to TLB
∙ TLB miss may take hundreds cycles
46/64
@kuksenk0#JavaPerfAndHWC
Example 2: dTLB misses
TLB = Translation Lookaside Buffer
∙ a memory cache that stores recent translations of virtual
memory addresses to physical addresses
∙ each memory access → access to TLB
∙ TLB miss may take hundreds cycles
How to check?
∙ dtlb_load_misses_miss_causes_a_walk
∙ dtlb_load_misses_walk_duration
46/64
@kuksenk0#JavaPerfAndHWC
Example 2: dTLB misses
IPC 0.54
cycles ×109
2277
instructions ×109
1226
L1-dcache-loads ×109
338
L1-dcache-load-misses ×109
38
LLC-loads ×109
25
dTLB-loads ×109
338
dTLB-load-misses ×109
6
dTLB-walks-duration
⋆
×109
296
13%of cycles (time)
⋆
in cycles
47/64
@kuksenk0#JavaPerfAndHWC
Example 2: dTLB misses
Fixing it:
∙ Enable -XX:+UseLargePages
∙ try to shrink application working set
48/64
@kuksenk0#JavaPerfAndHWC
Example 2: -XX:+UseLargePages
baseline large pages
IPC 0.54 0.64
cycles ×109
2277 2277
instructions ×109
1226 1460
L1-dcache-loads ×109
338 401
L1-dcache-load-misses ×109
38 38
LLC-loads ×109
25 25
dTLB-loads ×109
338 401
dTLB-load-misses ×109
6 0.24
dTLB-walks-duration ×109
296 2.6
Boost – 20%
49/64
@kuksenk0#JavaPerfAndHWC
Example 2: normalized per transaction
baseline large pages
IPC 0.54 0.64
cycles ×106
23.4 19.5
instructions ×106
12.5 12.5
L1-dcache-loads ×106
3.45 3.45
L1-dcache-load-misses ×106
0.39 0.33
LLC-loads ×106
0.26 0.21
dTLB-loads ×106
3.45 3.45
dTLB-load-misses ×106
0.06 0.002
dTLB-walks-duration ×106
3.04 0.022
50/64
@kuksenk0#JavaPerfAndHWC
Example 3
«A plague on both your houses»
«++ on both your threads»
51/64
@kuksenk0#JavaPerfAndHWC
Example 3: False Sharing
@State(Scope.Group)
public static class StateBaseline {
int field0;
int field1;
}
@Benchmark
@Group("baseline")
public int justDoIt(StateBaseline s) {
return s.field0 ++;
}
@Benchmark
@Group("baseline")
public int doItFromOtherThread(StateBaseline s) {
return s.field1 ++;
}
52/64
@kuksenk0#JavaPerfAndHWC
Example 3: Measure
resource
⋆
sharing padded
Same Core (HT) L1 9.5 4.9
Diff Cores (within socket) L3 (LLC) 10.6 2.8
Diff Sockets nothing 18.2 2.8
average time, ns/op
⋆
shared between threads
53/64
@kuksenk0#JavaPerfAndHWC
Example 3: What do HWC tell us?
Same Core Diff Cores Diff Sockets
sharing padded sharing padded sharing padded
CPI 1.3 0.7 1.4 0.4 1.7 0.4
cycles 33130 17536 36012 9163 46484 9608
instructions 26418 25865 26550 25747 26717 25768
L1-loads 12593 9467 9696 8973 9672 9016
L1-load-misses 10 5 12 4 33 3
L1-stores 4317 7838 7433 4069 6935 4074
L1-store-misses 5 2 161 2 55 1
LLС-loads 4 3 58 1 32 1
LLC-load-misses 1 1 53 ≈0 35 ≈0
LLC-stores 1 1 183 ≈0 49 ≈0
LLC-store-misses 1 ≈0 182 ≈0 48 ≈0
⋆
All values are normalized per 103
operations
54/64
@kuksenk0#JavaPerfAndHWC
Example 3: «on a core and a prayer»
in case of the single core(HT) we have to look into
MACHINE_CLEARS.MEMORY_ORDERING
Same Core Diff Cores Diff Sockets
sharing padded sharing padded sharing padded
CPI 1.3 0.7 1.4 0.4 1.7 0.4
cycles 33130 17536 36012 9163 46484 9608
instructions 26418 25865 26550 25747 26717 25768
CLEARS 238 ≈0 ≈0 ≈0 ≈0 ≈0
55/64
@kuksenk0#JavaPerfAndHWC
Example 3: Diff Cores
L2_STORE_LOCK_RQSTS - L2 RFOs breakdown
Diff Cores Diff Sockets
sharing padded sharing padded
CPI 1.4 0.4 1.7 0.4
LLC-stores 183 ≈0 49 ≈0
LLC-store-misses 182 ≈0 48 ≈0
L2_STORE_LOCK_RQSTS.MISS 134 ≈0 33 ≈0
L2_STORE_LOCK_RQSTS.HIT_E ≈0 ≈0 ≈0 ≈0
L2_STORE_LOCK_RQSTS.HIT_M ≈0 ≈0 ≈0 ≈0
L2_STORE_LOCK_RQSTS.ALL 183 ≈0 49 ≈0
56/64
@kuksenk0#JavaPerfAndHWC
Question!
To the audience
57/64
@kuksenk0#JavaPerfAndHWC
Example 3: Diff Cores
Diff Cores Diff Sockets
sharing padded sharing padded
CPI 1.4 0.4 1.7 0.4
LLC-stores 183 ≈0 49 ≈0
LLC-store-misses 182 ≈0 48 ≈0
L2_STORE_LOCK_RQSTS.MISS 134 ≈0 33 ≈0
L2_STORE_LOCK_RQSTS.ALL 183 ≈0 49 ≈0
Why 183 > 49 & 134 > 33,
but the same socket case is faster?
58/64
@kuksenk0#JavaPerfAndHWC
Example 3: Some events count duration
For example:
OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO
Diff Cores Diff Sockets
sharing padded sharing padded
CPI 1.4 0.4 1.7 0.4
cycles 36012 9163 46484 9608
instructions 26550 25747 26717 25768
O_R_O.CYCLES_W_D_RFO 21723 10 29601 56
59/64
@kuksenk0#JavaPerfAndHWC @kuksenk0#JavaPerfAndHWC
S
um
m
ary
@kuksenk0#JavaPerfAndHWC
Summary: Performance is easy
To achieve high performance:
∙ You have to know your Application!
∙ You have to know your Frameworks!
∙ You have to know your Virtual Machine!
∙ You have to know your Operating System!
∙ You have to know your Hardware!
61/64
@kuksenk0#JavaPerfAndHWC
Summary: «Here be dragons»
To achieve high performance:
∙ You have to know your Application!
∙ You have to know your Frameworks!
∙ You have to know your Virtual Machine!
∙ You have to know your Operating System!
∙ You have to know your Hardware!
61/64
@kuksenk0#JavaPerfAndHWC
Enlarge your knowledge with these simple tricks!
Reading list:
∙ “Computer Architecture: A Quantitative Approach”
John L. Hennessy, David A. Patterson
∙ CPU vendors documentation
∙ http://www.agner.org/optimize/
∙ http://www.google.com/search?q=Hardware+
performance+counter
∙ etc. . .
62/64
@kuksenk0#JavaPerfAndHWC @kuksenk0#JavaPerfAndHWC
Th
an
k
you
!
@kuksenk0#JavaPerfAndHWC @kuksenk0#JavaPerfAndHWC
Q
&
A

Más contenido relacionado

La actualidad más candente

Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded ProgrammingSri Prasanna
 
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Mr. Vengineer
 
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Mr. Vengineer
 
Concurrency Concepts in Java
Concurrency Concepts in JavaConcurrency Concepts in Java
Concurrency Concepts in JavaDoug Hawkins
 
Verification of Concurrent and Distributed Systems
Verification of Concurrent and Distributed SystemsVerification of Concurrent and Distributed Systems
Verification of Concurrent and Distributed SystemsMykola Novik
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversSatpal Parmar
 
Global Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the SealGlobal Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the SealTzung-Bi Shih
 
Introduction to Debuggers
Introduction to DebuggersIntroduction to Debuggers
Introduction to DebuggersSaumil Shah
 
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)Pixie Labs
 
Non-blocking synchronization — what is it and why we (don't?) need it
Non-blocking synchronization — what is it and why we (don't?) need itNon-blocking synchronization — what is it and why we (don't?) need it
Non-blocking synchronization — what is it and why we (don't?) need itAlexey Fyodorov
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoffmrphilroth
 
Operating Systems - A Primer
Operating Systems - A PrimerOperating Systems - A Primer
Operating Systems - A PrimerSaumil Shah
 
Kernel Recipes 2013 - Automating source code evolutions using Coccinelle
Kernel Recipes 2013 - Automating source code evolutions using CoccinelleKernel Recipes 2013 - Automating source code evolutions using Coccinelle
Kernel Recipes 2013 - Automating source code evolutions using CoccinelleAnne Nicolas
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
 
Production Time Profiling and Diagnostics on the JVM
Production Time Profiling and Diagnostics on the JVMProduction Time Profiling and Diagnostics on the JVM
Production Time Profiling and Diagnostics on the JVMMarcus Hirt
 
Eric Lafortune - The Jack and Jill build system
Eric Lafortune - The Jack and Jill build systemEric Lafortune - The Jack and Jill build system
Eric Lafortune - The Jack and Jill build systemGuardSquare
 

La actualidad más candente (20)

Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded Programming
 
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
 
Concurrency Concepts in Java
Concurrency Concepts in JavaConcurrency Concepts in Java
Concurrency Concepts in Java
 
Verification of Concurrent and Distributed Systems
Verification of Concurrent and Distributed SystemsVerification of Concurrent and Distributed Systems
Verification of Concurrent and Distributed Systems
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device Drivers
 
Global Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the SealGlobal Interpreter Lock: Episode I - Break the Seal
Global Interpreter Lock: Episode I - Break the Seal
 
Introduction to Debuggers
Introduction to DebuggersIntroduction to Debuggers
Introduction to Debuggers
 
TVM VTA (TSIM)
TVM VTA (TSIM) TVM VTA (TSIM)
TVM VTA (TSIM)
 
Return Oriented Programming (ROP) Based Exploits - Part I
Return Oriented Programming  (ROP) Based Exploits  - Part IReturn Oriented Programming  (ROP) Based Exploits  - Part I
Return Oriented Programming (ROP) Based Exploits - Part I
 
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
 
Non-blocking synchronization — what is it and why we (don't?) need it
Non-blocking synchronization — what is it and why we (don't?) need itNon-blocking synchronization — what is it and why we (don't?) need it
Non-blocking synchronization — what is it and why we (don't?) need it
 
Ember
EmberEmber
Ember
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
 
Operating Systems - A Primer
Operating Systems - A PrimerOperating Systems - A Primer
Operating Systems - A Primer
 
Kernel Recipes 2013 - Automating source code evolutions using Coccinelle
Kernel Recipes 2013 - Automating source code evolutions using CoccinelleKernel Recipes 2013 - Automating source code evolutions using Coccinelle
Kernel Recipes 2013 - Automating source code evolutions using Coccinelle
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
Production Time Profiling and Diagnostics on the JVM
Production Time Profiling and Diagnostics on the JVMProduction Time Profiling and Diagnostics on the JVM
Production Time Profiling and Diagnostics on the JVM
 
Eric Lafortune - The Jack and Jill build system
Eric Lafortune - The Jack and Jill build systemEric Lafortune - The Jack and Jill build system
Eric Lafortune - The Jack and Jill build system
 

Similar a Java Performance: Speedup your application with hardware counters

Inside the JVM - Follow the white rabbit!
Inside the JVM - Follow the white rabbit!Inside the JVM - Follow the white rabbit!
Inside the JVM - Follow the white rabbit!Sylvain Wallez
 
Inside the JVM - Follow the white rabbit! / Breizh JUG
Inside the JVM - Follow the white rabbit! / Breizh JUGInside the JVM - Follow the white rabbit! / Breizh JUG
Inside the JVM - Follow the white rabbit! / Breizh JUGSylvain Wallez
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 
Performance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12cPerformance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12cAjith Narayanan
 
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...Yuji Kubota
 
Sparklens: Understanding the Scalability Limits of Spark Applications with R...
 Sparklens: Understanding the Scalability Limits of Spark Applications with R... Sparklens: Understanding the Scalability Limits of Spark Applications with R...
Sparklens: Understanding the Scalability Limits of Spark Applications with R...Databricks
 
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense TechniqueDARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense TechniqueChong-Kuan Chen
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java ProfilingJerry Yoakum
 
Deep learning with Keras
Deep learning with KerasDeep learning with Keras
Deep learning with KerasQuantUniversity
 
The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]Mahmoud Hatem
 
Reactive programming with Rxjava
Reactive programming with RxjavaReactive programming with Rxjava
Reactive programming with RxjavaChristophe Marchal
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA accelerationMarco77328
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
Java In-Process Caching - Performance, Progress and Pittfalls
Java In-Process Caching - Performance, Progress and PittfallsJava In-Process Caching - Performance, Progress and Pittfalls
Java In-Process Caching - Performance, Progress and Pittfallscruftex
 

Similar a Java Performance: Speedup your application with hardware counters (20)

Inside the JVM - Follow the white rabbit!
Inside the JVM - Follow the white rabbit!Inside the JVM - Follow the white rabbit!
Inside the JVM - Follow the white rabbit!
 
Inside the JVM - Follow the white rabbit! / Breizh JUG
Inside the JVM - Follow the white rabbit! / Breizh JUGInside the JVM - Follow the white rabbit! / Breizh JUG
Inside the JVM - Follow the white rabbit! / Breizh JUG
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
Performance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12cPerformance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12c
 
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
HeapStats: Troubleshooting with Serviceability and the New Runtime Monitoring...
 
Sparklens: Understanding the Scalability Limits of Spark Applications with R...
 Sparklens: Understanding the Scalability Limits of Spark Applications with R... Sparklens: Understanding the Scalability Limits of Spark Applications with R...
Sparklens: Understanding the Scalability Limits of Spark Applications with R...
 
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense TechniqueDARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
 
Using AWR for SQL Analysis
Using AWR for SQL AnalysisUsing AWR for SQL Analysis
Using AWR for SQL Analysis
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
 
Deep learning with Keras
Deep learning with KerasDeep learning with Keras
Deep learning with Keras
 
The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]
 
Reactive programming with Rxjava
Reactive programming with RxjavaReactive programming with Rxjava
Reactive programming with Rxjava
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA acceleration
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Java In-Process Caching - Performance, Progress and Pittfalls
Java In-Process Caching - Performance, Progress and PittfallsJava In-Process Caching - Performance, Progress and Pittfalls
Java In-Process Caching - Performance, Progress and Pittfalls
 

Último

Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 

Último (20)

Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 

Java Performance: Speedup your application with hardware counters

  • 1. @kuksenk0#JavaPerfAndHWC Java Performance: Speedup your applications with hardware counters Sergey Kuksenko @kuksenk0
  • 2. @kuksenk0#JavaPerfAndHWC The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. 2/64
  • 3. @kuksenk0#JavaPerfAndHWC Intriguing Introductory Example Gotcha, here is my hot method or On Algorithmic O⋆ ptimizations ⋆ just-O 3/64
  • 4. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code public Matrix multiply(Matrix other) { int size = data.length; Matrix result = new Matrix(size); int [][] R = result.data; int [][] A = this.data; int [][] B = other.data; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j]; } R[i][j] = s; } } return result; } 4/64
  • 5. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code public Matrix multiply(Matrix other) { int size = data.length; Matrix result = new Matrix(size); int [][] R = result.data; int [][] A = this.data; int [][] B = other.data; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j]; } R[i][j] = s; } } return result; } 𝑂( 𝑁3)⋆ ⋆ big-O 4/64
  • 6. @kuksenk0#JavaPerfAndHWC Example 1: "Ok Google" 𝑂( 𝑁2.81) 5/64
  • 7. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code N multiply strassen 32 33 𝜇𝑠 64 252 𝜇𝑠 128 3019 𝜇𝑠 2119 𝜇𝑠 256 56 𝑚𝑠 15 𝑚𝑠 512 488 𝑚𝑠 111 𝑚𝑠 1024 3270 𝑚𝑠 785 𝑚𝑠 2048 32 𝑠 6 𝑠 4096 309 𝑠 42 𝑠 8192 2611 𝑠 293 𝑠 time/op 6/64
  • 9. @kuksenk0#JavaPerfAndHWC Example 1: JMH, -prof perfnorm, N=256⋆ per multiplication per iteration CPI 1.06 cycles 146 × 106 8.7 instructions 137 × 106 8.2 L1-loads 68 × 106 4 L1-load-misses 33 × 106 2 L1-stores 0.56 × 106 L1-store-misses 38357 LLC-loads 26 × 106 1.6 LLC-stores 11595 ⋆ ∼ 256Kb per matrix 8/64
  • 10. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code public Matrix multiply(Matrix other) { int size = data.length; Matrix result = new Matrix(size); int [][] R = result.data; int [][] A = this.data; int [][] B = other.data; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j] ; } R[i][j] = s; } } return result; } L1-load-misses 9/64
  • 11. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code public Matrix multiply(Matrix other) { int size = data.length; Matrix result = new Matrix(size); int [][] R = result.data; int [][] A = this.data; int [][] B = other.data; for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { int s = 0; for (int k = 0; k < size; k++) { s += A[i][k] * B[k][j] ; } R[i][j] = s; } } return result; } 9/64
  • 12. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code public Matrix multiplyIKJ(Matrix other) { int size = data.length; Matrix result = new Matrix(size); int [][] R = result.data; int [][] A = this.data; int [][] B = other.data; for (int i = 0; i < size; i++) { for (int k = 0; k < size; k++) { int aik = A[i][k]; for (int j = 0; j < size; j++) { R[i][j] += aik * B[k][j]; } } } return result; } 10/64
  • 13. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code N multIJK strassenIJK multIKJ 32 33 𝜇𝑠 13 𝜇𝑠 64 252 𝜇𝑠 80 𝜇𝑠 128 3019 𝜇𝑠 2119 𝜇𝑠 464 𝜇𝑠 256 56 𝑚𝑠 15 𝑚𝑠 4 𝑚𝑠 512 488 𝑚𝑠 111 𝑚𝑠 29 𝑚𝑠 1024 3270 𝑚𝑠 785 𝑚𝑠 369 𝑚𝑠 2048 32 𝑠 6 𝑠 3.4 𝑠 4096 309 𝑠 42 𝑠 25 𝑠 8192 2611 𝑠 293 𝑠 210 𝑠 time/op 11/64
  • 14. @kuksenk0#JavaPerfAndHWC Example 1: Very Hot Code N multIJK strassenIJK multIKJ strassenIKJ 32 33 𝜇𝑠 13 𝜇𝑠 64 252 𝜇𝑠 80 𝜇𝑠 128 3019 𝜇𝑠 2119 𝜇𝑠 464 𝜇𝑠 256 56 𝑚𝑠 15 𝑚𝑠 4 𝑚𝑠 512 488 𝑚𝑠 111 𝑚𝑠 29 𝑚𝑠 25 𝑚𝑠 1024 3270 𝑚𝑠 785 𝑚𝑠 369 𝑚𝑠 192 𝑚𝑠 2048 32 𝑠 6 𝑠 3.4 𝑠 1.4 𝑠 4096 309 𝑠 42 𝑠 25 𝑠 10 𝑠 8192 2611 𝑠 293 𝑠 210 𝑠 73 𝑠 time/op 11/64
  • 15. @kuksenk0#JavaPerfAndHWC Example 1: JMH, -prof perfnorm, N=256 IJK IKJ multiply iteration multiply iteration CPI 1.06 0.51 cycles 146 × 106 8.7 9.7 × 106 0.6 instructions 137 × 106 8.2 19 × 106 1.1 L1-loads 68 × 106 4 5.4 × 106 0.3 L1-load-misses 33 × 106 2 1.1 × 106 0.1 L1-stores 0.56 × 106 2.7 × 106 0.2 L1-store-misses 38357 8959 LLC-loads 26 × 106 1.6 0.3 × 106 LLC-stores 11595 3532 12/64
  • 16. @kuksenk0#JavaPerfAndHWC Example 1: Free beer benefits! cycles insts ... 6.42% 5.54% 0x00007ff6591d20c5: vmovdqu %ymm1 ,0x10(%r14 ,%rbx ,4) ;* iastore 9.49% 11.91% 0x00007ff6591d20cc: add $0x8 ,%ebx ;*iinc 0.15% 0.05% 0x00007ff6591d20cf: cmp %r11d ,%ebx 0x00007ff6591d20d2: jl 0x00007ff6591d20b2 ;* if_icmpge ... 13/64
  • 17. @kuksenk0#JavaPerfAndHWC Example 1: Free beer benefits! cycles insts ... 6.42% 5.54% 0x00007ff6591d20c5: vmovdqu %ymm1 ,0x10(%r14 ,%rbx ,4) ;* iastore 9.49% 11.91% 0x00007ff6591d20cc: add $0x8 ,%ebx ;*iinc 0.15% 0.05% 0x00007ff6591d20cf: cmp %r11d ,%ebx 0x00007ff6591d20d2: jl 0x00007ff6591d20b2 ;* if_icmpge ... Vectorization (SSE/AVX)! 13/64
  • 18. @kuksenk0#JavaPerfAndHWC Chapter 1 Performance Optimization Methodology: A Very Short Introduction. Three magic questions and the direction of the journey 14/64
  • 20. @kuksenk0#JavaPerfAndHWC Three magic questions What? ∙ What prevents my application to work faster? (monitoring) 15/64
  • 21. @kuksenk0#JavaPerfAndHWC Three magic questions What? ⇒ Where? ∙ What prevents my application to work faster? (monitoring) ∙ Where does it hide? (profiling) 15/64
  • 22. @kuksenk0#JavaPerfAndHWC Three magic questions What? ⇒ Where? ⇒ How? ∙ What prevents my application to work faster? (monitoring) ∙ Where does it hide? (profiling) ∙ How to stop it messing with performance? (tuning/optimizing) 15/64
  • 23. @kuksenk0#JavaPerfAndHWC Top-Down Approach ∙ System Level – Network, Disk, OS, CPU/Memory ∙ JVM Level – GC/Heap, JIT, Classloading ∙ Application Level – Algorithms, Synchronization, Threading, API ∙ Microarchitecture Level – Caches, Data/Code alignment, CPU Pipeline Stalls 16/64
  • 26. @kuksenk0#JavaPerfAndHWC Let’s go (Top-Down) Do system monitoring (e.g. mpstat) 18/64
  • 27. @kuksenk0#JavaPerfAndHWC Let’s go (Top-Down) Do system monitoring (e.g. mpstat) ∙ Lots of %sys ⇒ . . . ⇓ 18/64
  • 28. @kuksenk0#JavaPerfAndHWC Let’s go (Top-Down) Do system monitoring (e.g. mpstat) ∙ Lots of %sys ⇒ . . . ⇓ ∙ Lots of %irq, %soft ⇒ . . . ⇓ 18/64
  • 29. @kuksenk0#JavaPerfAndHWC Let’s go (Top-Down) Do system monitoring (e.g. mpstat) ∙ Lots of %sys ⇒ . . . ⇓ ∙ Lots of %irq, %soft ⇒ . . . ⇓ ∙ Lots of %iowait ⇒ . . . ⇓ 18/64
  • 30. @kuksenk0#JavaPerfAndHWC Let’s go (Top-Down) Do system monitoring (e.g. mpstat) ∙ Lots of %sys ⇒ . . . ⇓ ∙ Lots of %irq, %soft ⇒ . . . ⇓ ∙ Lots of %iowait ⇒ . . . ⇓ ∙ Lots of %idle ⇒ . . . ⇓ 18/64
  • 31. @kuksenk0#JavaPerfAndHWC Let’s go (Top-Down) Do system monitoring (e.g. mpstat) ∙ Lots of %sys ⇒ . . . ⇓ ∙ Lots of %irq, %soft ⇒ . . . ⇓ ∙ Lots of %iowait ⇒ . . . ⇓ ∙ Lots of %idle ⇒ . . . ⇓ ∙ Lots of %user 18/64
  • 32. @kuksenk0#JavaPerfAndHWC Chapter 2 High CPU Load and the main question: «who/what is to blame?» 19/64
  • 33. @kuksenk0#JavaPerfAndHWC CPU Utilization ∙ What does ∼100% CPU Utilization mean? 20/64
  • 34. @kuksenk0#JavaPerfAndHWC CPU Utilization ∙ What does ∼100% CPU Utilization mean? – OS has enough tasks to schedule 20/64
  • 35. @kuksenk0#JavaPerfAndHWC CPU Utilization ∙ What does ∼100% CPU Utilization mean? – OS has enough tasks to schedule Can profiling help? 20/64
  • 36. @kuksenk0#JavaPerfAndHWC CPU Utilization ∙ What does ∼100% CPU Utilization mean? – OS has enough tasks to schedule Can profiling help? Profiler ⋆ shows «WHERE» application time is spent, but there’s no answer to the question «WHY». ⋆ traditional profiler 20/64
  • 37. @kuksenk0#JavaPerfAndHWC Who/What is to blame? Complex CPU microarchitecture: ∙ Inefficient algorithm ⇒ 100% CPU ∙ Pipeline stall due to memory load/stores ⇒ 100% CPU ∙ Pipeline flush due to mispredicted branch ⇒ 100% CPU ∙ Expensive instructions ⇒ 100% CPU ∙ Insufficient ILP ⋆ ⇒ 100% CPU ∙ etc. ⇒ 100% CPU ⋆ Instruction Level Parallelism 21/64
  • 39. @kuksenk0#JavaPerfAndHWC PMU: Performance Monitoring Unit Performance Monitoring Unit - profiles hardware activity, built into CPU. 23/64
  • 40. @kuksenk0#JavaPerfAndHWC PMU: Performance Monitoring Unit Performance Monitoring Unit - profiles hardware activity, built into CPU. PMU Internals (in less than 21 seconds) : Hardware counters (HWC) count hardware performance events (performance monitoring events) 23/64
  • 42. @kuksenk0#JavaPerfAndHWC Events: Issues ∙ Hundreds of events ∙ Microarchitectural experience is required ∙ Platform dependent – vary from CPU vendor to vendor – may vary when CPU manufacturer introduces new microarchitecture ∙ How to work with HWC? 25/64
  • 43. @kuksenk0#JavaPerfAndHWC HWC HWC modes: ∙ Counting mode – if (event_happened) ++counter; – general monitoring (answering «WHAT?» ) ∙ Sampling mode – if (event_happened) if (++counter < threshold) INTERRUPT; – profiling (answering «WHERE?») 26/64
  • 44. @kuksenk0#JavaPerfAndHWC HWC: Issues So many events, so few counters e.g. “Nehalem”: – over 900 events – 7 HWC (3 fixed + 4 programmable) ∙ «multiple-running» (different events) – Repeatability ∙ «multiplexing» (only a few tools are able to do that) – Steady state 27/64
  • 45. @kuksenk0#JavaPerfAndHWC HWC: Issues (cont.) Sampling mode: ∙ «instruction skid» (hard to correlate event and instruction) ∙ Uncore events (hard to bind event and execution thread; e.g. shared L3 cache) 28/64
  • 46. @kuksenk0#JavaPerfAndHWC HWC: typical usages ∙ hardware validation ∙ performance analysis 29/64
  • 47. @kuksenk0#JavaPerfAndHWC HWC: typical usages ∙ hardware validation ∙ performance analysis ∙ run-time tuning (e.g. JRockit, etc.) ∙ security attacks and defenses ∙ test code coverage ∙ etc. 29/64
  • 48. @kuksenk0#JavaPerfAndHWC HWC: tools ∙ Oracle Solaris Studio Performance Analyzer (http://www.oracle.com/technetwork/server-storage/solarisstudio) ∙ perf/perf_events (http://perf.wiki.kernel.org) ∙ JMH (http://openjdk.java.net/projects/code-tools/jmh/) -prof perf -prof perfnorm = perf, normalized per operation -prof perfasm = perf + -XX:+PrintAssembly 30/64
  • 49. @kuksenk0#JavaPerfAndHWC HWC: tools (cont.) ∙ AMD CodeXL ∙ Intel Vtune Amplifier XE ∙ etc. . . 31/64
  • 50. @kuksenk0#JavaPerfAndHWC perf events (e.g.) ∙ cycles ∙ instrustions ∙ cache-references ∙ cache-misses ∙ branches ∙ branch-misses ∙ bus-cycles ∙ ref-cycles ∙ dTLB-loads ∙ dTLB-load-misses ∙ L1-dcache-loads ∙ L1-dcache-load-misses ∙ L1-dcache-stores ∙ L1-dcache-store-misses ∙ LLC-loads ∙ etc... 32/64
  • 51. @kuksenk0#JavaPerfAndHWC Oracle Studio events (e.g.) ∙ cycles ∙ insts ∙ branch-instruction-retired ∙ branch-misses-retired ∙ dtlbm ∙ l1h ∙ l1m ∙ l2h ∙ l2m ∙ l3h ∙ l3m ∙ etc... 33/64
  • 52. @kuksenk0#JavaPerfAndHWC Chapter 4 I’ve got HWC data. What’s next? or Introduction to microarchitecture performance analysis 34/64
  • 53. @kuksenk0#JavaPerfAndHWC Execution time 𝑡𝑖𝑚𝑒 = 𝑐𝑦𝑐𝑙𝑒𝑠 𝑓 𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 Optimization is. . . reducing the (spent) cycle count! ⋆ ⋆ everything else is overclocking 35/64
  • 54. @kuksenk0#JavaPerfAndHWC Microarchitecture Equation 𝑐𝑦𝑐𝑙𝑒𝑠 = 𝑃 𝑎𝑡ℎ𝐿𝑒𝑛𝑔𝑡ℎ * 𝐶𝑃 𝐼 = 𝑃 𝑎𝑡ℎ𝐿𝑒𝑛𝑔𝑡ℎ * 1 𝐼𝑃 𝐶 ∙ PathLength - number of instructions ∙ CPI - cycles per instruction ∙ IPC - instructions per cycle 36/64
  • 55. @kuksenk0#JavaPerfAndHWC 𝑃 𝑎𝑡ℎ𝐿𝑒𝑛𝑔𝑡ℎ * 𝐶𝑃 𝐼 ∙ PathLength ∼ algorithm efficiency (the smaller the better) ∙ CPI ∼ CPU efficiency (the smaller the better) – 𝐶𝑃 𝐼 = 4 – bad! – 𝐶𝑃 𝐼 = 1 ∙ Nehalem – just good enough! ∙ SandyBridge and later – not so good! – 𝐶𝑃 𝐼 = 0.4 – good! – 𝐶𝑃 𝐼 = 0.2 – ideal! 37/64
  • 56. @kuksenk0#JavaPerfAndHWC What to do? ∙ low CPI – reduce PathLength → «tune algorithm» ∙ high CPI ⇒ stalls – memory stalls → «tune data structures» – branch stalls → «tune control logic» – instruction dependency → «break dependency chains» – long latency ops → «use more simple operations» – etc. . . 38/64
  • 57. @kuksenk0#JavaPerfAndHWC High CPI: Memory bound ∙ dTLB misses ∙ L1,L2,L3,...,LN misses ∙ NUMA: non-local memory access ∙ memory bandwidth ∙ false/true sharing ∙ cache line split (not in Java world, except. . . ) ∙ store forwarding (unlikely, hard to fix on Java level) ∙ 4k aliasing 39/64
  • 58. @kuksenk0#JavaPerfAndHWC High CPI: Core bound ∙ long latency operations: DIV, SQRT ∙ FP assist: floating points denormal, NaN, inf ∙ bad speculation (caused by mispredicted branch) ∙ port saturation (forget it) 40/64
  • 59. @kuksenk0#JavaPerfAndHWC High CPI: Front-End bound ∙ iTLB miss ∙ iCache miss ∙ branch mispredict ∙ LSD (loop stream decoder) solvable by HotSpot tweaking 41/64
  • 61. @kuksenk0#JavaPerfAndHWC Example 2 Some «large» standard server-side Java benchmark 43/64
  • 62. @kuksenk0#JavaPerfAndHWC Example 2: perf stat -d 880851.993237 task -clock (msec) 39,318 context -switches 437 cpu -migrations 7,931 page -faults 2 ,277 ,063 ,113 ,376 cycles 1 ,226 ,299 ,634 ,877 instructions # 0.54 insns per cycle 229 ,500 ,265 ,931 branches # 260.544 M/sec 4 ,620 ,666 ,169 branch -misses # 2.01% of all branches 338 ,169 ,489 ,902 L1-dcache -loads # 383.912 M/sec 37 ,937 ,596 ,505 L1-dcache -load -misses # 11.22% of all L1-dcache hits 25 ,232 ,434 ,666 LLC -loads # 28.645 M/sec 7 ,307 ,884 ,874 L1-icache -load -misses # 0.00% of all L1 -icache hits 337 ,730 ,278 ,697 dTLB -loads # 382.846 M/sec 6 ,094 ,356 ,801 dTLB -load -misses # 1.81% of all dTLB cache hits 12 ,210 ,841 ,909 iTLB -loads # 13.863 M/sec 431 ,803 ,270 iTLB -load -misses # 3.54% of all iTLB cache hits 301.557213044 seconds time elapsed 44/64
  • 63. @kuksenk0#JavaPerfAndHWC Example 2: Processed perf stat -d IPC 0.54 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 45/64
  • 64. @kuksenk0#JavaPerfAndHWC Example 2: Processed perf stat -d IPC 0.54 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 Potential Issues? 45/64
  • 65. @kuksenk0#JavaPerfAndHWC Example 2: Processed perf stat -d IPC 0.54 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 Let’s count: 4 × (338 × 109 )+ 12 × ((38 − 25) × 109 )+ 36 × (25 × 109 ) ≈ 2408 × 109 cycles ??? Potential Issues? 45/64
  • 66. @kuksenk0#JavaPerfAndHWC Example 2: Processed perf stat -d IPC 0.54 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 Let’s count: 4 × (338 × 109 )+ 12 × ((38 − 25) × 109 )+ 36 × (25 × 109 ) ≈ 2408 × 109 cycles ??? Superscalar in Action! 45/64
  • 67. @kuksenk0#JavaPerfAndHWC Example 2: Processed perf stat -d IPC 0.54 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 Issue! 45/64
  • 68. @kuksenk0#JavaPerfAndHWC Example 2: dTLB misses TLB = Translation Lookaside Buffer ∙ a memory cache that stores recent translations of virtual memory addresses to physical addresses ∙ each memory access → access to TLB ∙ TLB miss may take hundreds cycles 46/64
  • 69. @kuksenk0#JavaPerfAndHWC Example 2: dTLB misses TLB = Translation Lookaside Buffer ∙ a memory cache that stores recent translations of virtual memory addresses to physical addresses ∙ each memory access → access to TLB ∙ TLB miss may take hundreds cycles How to check? ∙ dtlb_load_misses_miss_causes_a_walk ∙ dtlb_load_misses_walk_duration 46/64
  • 70. @kuksenk0#JavaPerfAndHWC Example 2: dTLB misses IPC 0.54 cycles ×109 2277 instructions ×109 1226 L1-dcache-loads ×109 338 L1-dcache-load-misses ×109 38 LLC-loads ×109 25 dTLB-loads ×109 338 dTLB-load-misses ×109 6 dTLB-walks-duration ⋆ ×109 296 13%of cycles (time) ⋆ in cycles 47/64
  • 71. @kuksenk0#JavaPerfAndHWC Example 2: dTLB misses Fixing it: ∙ Enable -XX:+UseLargePages ∙ try to shrink application working set 48/64
  • 72. @kuksenk0#JavaPerfAndHWC Example 2: -XX:+UseLargePages baseline large pages IPC 0.54 0.64 cycles ×109 2277 2277 instructions ×109 1226 1460 L1-dcache-loads ×109 338 401 L1-dcache-load-misses ×109 38 38 LLC-loads ×109 25 25 dTLB-loads ×109 338 401 dTLB-load-misses ×109 6 0.24 dTLB-walks-duration ×109 296 2.6 Boost – 20% 49/64
  • 73. @kuksenk0#JavaPerfAndHWC Example 2: normalized per transaction baseline large pages IPC 0.54 0.64 cycles ×106 23.4 19.5 instructions ×106 12.5 12.5 L1-dcache-loads ×106 3.45 3.45 L1-dcache-load-misses ×106 0.39 0.33 LLC-loads ×106 0.26 0.21 dTLB-loads ×106 3.45 3.45 dTLB-load-misses ×106 0.06 0.002 dTLB-walks-duration ×106 3.04 0.022 50/64
  • 74. @kuksenk0#JavaPerfAndHWC Example 3 «A plague on both your houses» «++ on both your threads» 51/64
  • 75. @kuksenk0#JavaPerfAndHWC Example 3: False Sharing @State(Scope.Group) public static class StateBaseline { int field0; int field1; } @Benchmark @Group("baseline") public int justDoIt(StateBaseline s) { return s.field0 ++; } @Benchmark @Group("baseline") public int doItFromOtherThread(StateBaseline s) { return s.field1 ++; } 52/64
  • 76. @kuksenk0#JavaPerfAndHWC Example 3: Measure resource ⋆ sharing padded Same Core (HT) L1 9.5 4.9 Diff Cores (within socket) L3 (LLC) 10.6 2.8 Diff Sockets nothing 18.2 2.8 average time, ns/op ⋆ shared between threads 53/64
  • 77. @kuksenk0#JavaPerfAndHWC Example 3: What do HWC tell us? Same Core Diff Cores Diff Sockets sharing padded sharing padded sharing padded CPI 1.3 0.7 1.4 0.4 1.7 0.4 cycles 33130 17536 36012 9163 46484 9608 instructions 26418 25865 26550 25747 26717 25768 L1-loads 12593 9467 9696 8973 9672 9016 L1-load-misses 10 5 12 4 33 3 L1-stores 4317 7838 7433 4069 6935 4074 L1-store-misses 5 2 161 2 55 1 LLС-loads 4 3 58 1 32 1 LLC-load-misses 1 1 53 ≈0 35 ≈0 LLC-stores 1 1 183 ≈0 49 ≈0 LLC-store-misses 1 ≈0 182 ≈0 48 ≈0 ⋆ All values are normalized per 103 operations 54/64
  • 78. @kuksenk0#JavaPerfAndHWC Example 3: «on a core and a prayer» in case of the single core(HT) we have to look into MACHINE_CLEARS.MEMORY_ORDERING Same Core Diff Cores Diff Sockets sharing padded sharing padded sharing padded CPI 1.3 0.7 1.4 0.4 1.7 0.4 cycles 33130 17536 36012 9163 46484 9608 instructions 26418 25865 26550 25747 26717 25768 CLEARS 238 ≈0 ≈0 ≈0 ≈0 ≈0 55/64
  • 79. @kuksenk0#JavaPerfAndHWC Example 3: Diff Cores L2_STORE_LOCK_RQSTS - L2 RFOs breakdown Diff Cores Diff Sockets sharing padded sharing padded CPI 1.4 0.4 1.7 0.4 LLC-stores 183 ≈0 49 ≈0 LLC-store-misses 182 ≈0 48 ≈0 L2_STORE_LOCK_RQSTS.MISS 134 ≈0 33 ≈0 L2_STORE_LOCK_RQSTS.HIT_E ≈0 ≈0 ≈0 ≈0 L2_STORE_LOCK_RQSTS.HIT_M ≈0 ≈0 ≈0 ≈0 L2_STORE_LOCK_RQSTS.ALL 183 ≈0 49 ≈0 56/64
  • 81. @kuksenk0#JavaPerfAndHWC Example 3: Diff Cores Diff Cores Diff Sockets sharing padded sharing padded CPI 1.4 0.4 1.7 0.4 LLC-stores 183 ≈0 49 ≈0 LLC-store-misses 182 ≈0 48 ≈0 L2_STORE_LOCK_RQSTS.MISS 134 ≈0 33 ≈0 L2_STORE_LOCK_RQSTS.ALL 183 ≈0 49 ≈0 Why 183 > 49 & 134 > 33, but the same socket case is faster? 58/64
  • 82. @kuksenk0#JavaPerfAndHWC Example 3: Some events count duration For example: OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO Diff Cores Diff Sockets sharing padded sharing padded CPI 1.4 0.4 1.7 0.4 cycles 36012 9163 46484 9608 instructions 26550 25747 26717 25768 O_R_O.CYCLES_W_D_RFO 21723 10 29601 56 59/64
  • 84. @kuksenk0#JavaPerfAndHWC Summary: Performance is easy To achieve high performance: ∙ You have to know your Application! ∙ You have to know your Frameworks! ∙ You have to know your Virtual Machine! ∙ You have to know your Operating System! ∙ You have to know your Hardware! 61/64
  • 85. @kuksenk0#JavaPerfAndHWC Summary: «Here be dragons» To achieve high performance: ∙ You have to know your Application! ∙ You have to know your Frameworks! ∙ You have to know your Virtual Machine! ∙ You have to know your Operating System! ∙ You have to know your Hardware! 61/64
  • 86. @kuksenk0#JavaPerfAndHWC Enlarge your knowledge with these simple tricks! Reading list: ∙ “Computer Architecture: A Quantitative Approach” John L. Hennessy, David A. Patterson ∙ CPU vendors documentation ∙ http://www.agner.org/optimize/ ∙ http://www.google.com/search?q=Hardware+ performance+counter ∙ etc. . . 62/64