How shit works: the CPU

How shit works:
the CPU
Tomer Gabel
BuildStuff 2016 Lithuania
Image: Telecarlos (CC BY-SA 3.0)

Full Disclosure
Bullshit ahead!
• I’m not an expert
• Explanations may be:
– Simplified
– Inaccurate
– Wrong :-)
• We’ll barely scratch the
surface
Image: Public Domain

A CONUNDRUM?
Are you ready for…
Image: Louis Reed (CC BY-SA 4.0)

Setting the Stage
// Generate a bunch of bytes
byte[] data = new byte[32768];
new Random().nextBytes(data);
Arrays.sort(data);
// Sum positive elements
long sum = 0;
for (int i = 0; i < data.length; i++)
if (data[i] >= 0)
sum += data[i];
1. Which is faster?
2. By how much?
3. And crucially…
why?!

# Run complete. Total time: 00:00:32
Benchmark Mode Cnt Score Error Units
Baseline.sum avgt 6 115.666 ± 3.137 us/op
Presorted.sum avgt 6 13.741 ± 0.524 us/op
Surprise, Terror and Ruthless Efficiency
# Run complete. Total time: 00:00:32
Benchmark Mode Cnt Error Units
Baseline.sum avgt 6 ± 3.137 us/op
Presorted.sum avgt 6 ± 0.524 us/op
* Ignoring setup cost

CPUS ARE
COMPLEX
BEASTS.
Image: Pauli Rautakorpi (CC BY 3.0)

It Is Known
• Your high-level code…
long sum = 0;
for (i = 0; i < length; i++)
if (data[i] >= 0)
sum += data[i];
• Gets compiled down to…
movsx eax,BYTE PTR [rax+rdx*1+0x10]
cmp eax,0x0
movabs rdx,0x11f3a9f60
movabs rcx,0x128
jl 0x000000010679e077
movabs rcx,0x138
mov r8,QWORD PTR [rdx+rcx*1]
lea r8,[r8+0x1]
mov QWORD PTR [rdx+rcx*1],r8
jl 0x000000010679e092
movsxd rax,eax
add rax,rbx
mov rbx,rax
inc edi

It Is Less Known
• What happens then?
• The instruction goes through phases…
Fetch Decode Execute
Memory
Access
Write-
back
Instruction
Stream

CPU Architecture 101
Image: Appaloosa (CC BY-SA 3.0)

• What does a CPU do?
– Reads the program

– Figures it out

– Figures it out
– Executes it

– Figures it out
– Executes it
– Talks to memory

– Figures it out
– Executes it
– Talks to memory
– Performs I/O

– Figures it out
– Executes it
– Talks to memory
– Performs I/O
• Immense complexity!

Execution Units
• Arithmetic-Logic Unit (ALU)
– Boolean algebra
– Arithmetic
– Memory accesses
– Flow control
• Floating Point Unit (FPU)
• Memory Management Unit (MMU)
– Memory mapping
– Paging
– Access control
Images: ALU by Dirk Oppelt (CC BY-SA 3.0), FPU by Konstantin Lanzet (CC BY-SA 3.0), MMU from unknown source

DESIGN
CONSIDERATIONS
Image: William M. Plate Jr. (Public Domain)

Memory
Access
Write-
back
Memory
Access
Write-
back
Memory
Access
Write-
back
I1
I0
I2
Pipelining
Sequential Execution
Latency = 5 cycles
Throughput= 0.2 ops / cycle

Memory
Access
Write-
back
I1
I0
I2
Memory
Access
Pipelining
Sequential Execution Pipelined Execution
Latency = 5 cycles
Throughput= 0.2 ops / cycle
Latency = 5 cycles
Throughput= 1 ops / cycle
Memory
Access
Write-
back
Memory
Access
Write-
back
Memory
Access
Write-
back
I1
I0
I2

Pipelining
• A pipeline can stall
• This happens with:
– Branches
if (i < 0) i++ else i--;
F D E M WMemory Load
F D E MTest
F D EConditional
Jump
? ????

F D E M WIncrement
memory address
F D E M
F D Stall
F D
Load from
memory
Add +1
Store in
memory
Pipelining
• A pipeline can stall
• This happens with:
– Branches
– Dependent Instructions
• A.K.A pipeline bubbling
i++;
x = i + 1;
Stall

PRACTICAL
RAMIFICATIONS
Image: Hangsna (CC BY-SA 3.0)

1. Memory is Slow
• RAM access is ~60ns
• Random access on a
4GHz, 64-bit CPU:
– 250 cycles / memory access
– 130MB / second bandwidth
• Surely we can do better!
Image: Noah Wieder (Public Domain)
Source: 7-cpu.com

Enter: CPU Cache
Level Size Latency
L1 32KB + 32KB 1ns
L2 256KB 3ns
L3 4MB 11ns
Main Memory 62ns
Intel i7-6700 “Skylake” at 4 GHz
Image: Ferry24.Milan (CC BY-SA 3.0)
Source: 7-cpu.com

Enter: CPU Cache
• A unit of work is
called cache line
– 64 bytes on x86
– LRU eviction policy
• Why is sequential
access fast?
– Cache prefetching

In Real Life
• Let’s rotate an image!
for (y = 0; y < height; y++)
for (x = 0; x < width; x++) {
int from = y * width + x;
int to = x * height + y;
target[to] = source[from];
}
Image: EgoAltere (CC0 Public Domain)

In Real Life
• This is not efficient
• Reads are sequential
0 1 2 3 ... 9
0
1
2
3
…
9

In Real Life
0 1 2 3 ... 9
0 0 1 2 3 … 9
1
2
3
…
9

In Real Life
• Writes aren’t, though
• Different strides
– Worst case wins :-(
0 1 2 3 ... 9
0 0 1 2 3 … 9
1 10
2 20
3 30
… …
9 90

Cache-Friendly Algorithms
• Use blocking or tiling
for (y = 0; y < height; y += blockHeight)
for (x = 0; x < width; x += blockWidth)
for (by = 0; by < blockHeight; by++)
for (bx = 0; bx < blockWidth; bx++) {
int from = (y + by) * width + (x + bx);
int to = (x + bx) * height + (y + by);
target[to] = source[from];
}

Cache-Friendly Algorithms
• The results?
Benchmark Mode Cnt Score Error Units
CachingShowcase.transposeNaive avgt 10 43.851 ± 6.000 ms/op
CachingShowcase.transposeTiled8x8 avgt 10 20.641 ± 1.646 ms/op
• The results?
Benchmark Mode Cnt Error Units
CachingShowcase.transpose avgt 10 ± 6.000 ms/op
x2.37 speedup!

2. Those Pesky Branches
• Do I go left or right?
• Need input!
• … but can’t wait for it
• Maybe...
– Take a guess?
– Based on historic trends?
• Sounds speculative
Image: Michael Dolan (CC BY 2.0)

Those Pesky Branches
• Enter: Branch Prediction
• Concurrently:
– Speculate branch
– Evaluate condition
• It’s now a tradeoff
– Commit is fast
– Rollback is slow
Image: Alejandro C. (CC BY-NC 2.0)

Arrays.sort(data);
long sum = 0;
if (data[i] >= 0)
sum += data[i];
Back to Our Conundrum
• Can you guess?
– 3…
– 2...
– 1...
• Here it is!
Arrays.sort(data);
long sum = 0;
if (data[i] >= 0)
sum += data[i];

Catharsis
54 10 -4 -2 15 41
-
37
13 0 -9 14 25
-
61
40
Original data array:

Catharsis
-
61
-
37
-9 -4 -2 0 10 13 14 15 25 40 41 54
After sorting:
0
data[i] >= 0
Always false!
data[i] >= 0
Always true!

QUESTIONS?
Thank you for listening
tomer@tomergabel.com
@tomerg
http://engineering.wix.com
Sources and Examples:
https://goo.gl/f7NfGT
This work is licensed under a Creative
Commons Attribution-ShareAlike 4.0
International License.

Further Reading
• Jason Robert Carey Patterson –
Modern Microprocessors, a 90-Minute Guide
• Igor Ostrovsky - Gallery of Processor Cache
Effects
• Piyush Kumar –
Cache Oblivious Algorithms

How shit works: the CPU

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (12)

Similar a How shit works: the CPU

Similar a How shit works: the CPU (20)

Más de Tomer Gabel

Más de Tomer Gabel (20)

Último

Último (20)

How shit works: the CPU