The beautiful thing about software engineering is that it gives you the warm and fuzzy illusion of total understanding: I control this machine because I know how it operates. This is the result of layers upon layers of successful abstractions, which hide immense sophistication and complexity. As with any abstraction, though, these sometimes leak, and that's when a good grounding in what's under the hood pays off.
The second talk in this series peels a few layers of abstraction and takes a look under the hood of our "car engine", the CPU. While hardly anyone codes in assembly language anymore, your C# or JavaScript (or Scala or...) application still ends up executing machine code instructions on a processor; that is why Java has a memory model, why memory layout still matters at scale, and why you're usually free to ignore these considerations and go about your merry way.
You'll come away knowing a little bit about a lot of different moving parts under the hood; after all, isn't understanding how the machine operates what this is all about?
(From a talk given at BuildStuff 2016 in Vilnius, Lithuania.)
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
How shit works: the CPU
1. How shit works:
the CPU
Tomer Gabel
BuildStuff 2016 Lithuania
Image: Telecarlos (CC BY-SA 3.0)
2. Full Disclosure
Bullshit ahead!
• I’m not an expert
• Explanations may be:
– Simplified
– Inaccurate
– Wrong :-)
• We’ll barely scratch the
surface
Image: Public Domain
4. Setting the Stage
// Generate a bunch of bytes
byte[] data = new byte[32768];
new Random().nextBytes(data);
Arrays.sort(data);
// Sum positive elements
long sum = 0;
for (int i = 0; i < data.length; i++)
if (data[i] >= 0)
sum += data[i];
1. Which is faster?
2. By how much?
3. And crucially…
why?!
5. # Run complete. Total time: 00:00:32
Benchmark Mode Cnt Score Error Units
Baseline.sum avgt 6 115.666 ± 3.137 us/op
Presorted.sum avgt 6 13.741 ± 0.524 us/op
Surprise, Terror and Ruthless Efficiency
# Run complete. Total time: 00:00:32
Benchmark Mode Cnt Error Units
Baseline.sum avgt 6 ± 3.137 us/op
Presorted.sum avgt 6 ± 0.524 us/op
* Ignoring setup cost
12. CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
13. CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
– Talks to memory
14. CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
– Talks to memory
– Performs I/O
15. CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
– Talks to memory
– Performs I/O
• Immense complexity!
16. Execution Units
• Arithmetic-Logic Unit (ALU)
– Boolean algebra
– Arithmetic
– Memory accesses
– Flow control
• Floating Point Unit (FPU)
• Memory Management Unit (MMU)
– Memory mapping
– Paging
– Access control
Images: ALU by Dirk Oppelt (CC BY-SA 3.0), FPU by Konstantin Lanzet (CC BY-SA 3.0), MMU from unknown source
20. Pipelining
• A pipeline can stall
• This happens with:
– Branches
if (i < 0) i++ else i--;
F D E M WMemory Load
F D E MTest
F D EConditional
Jump
? ????
21. F D E M WIncrement
memory address
F D E M
F D Stall
F D
Load from
memory
Add +1
Store in
memory
Pipelining
• A pipeline can stall
• This happens with:
– Branches
– Dependent Instructions
• A.K.A pipeline bubbling
i++;
x = i + 1;
Stall
23. 1. Memory is Slow
• RAM access is ~60ns
• Random access on a
4GHz, 64-bit CPU:
– 250 cycles / memory access
– 130MB / second bandwidth
• Surely we can do better!
Image: Noah Wieder (Public Domain)
Source: 7-cpu.com
25. Enter: CPU Cache
• A unit of work is
called cache line
– 64 bytes on x86
– LRU eviction policy
• Why is sequential
access fast?
– Cache prefetching
26. In Real Life
• Let’s rotate an image!
for (y = 0; y < height; y++)
for (x = 0; x < width; x++) {
int from = y * width + x;
int to = x * height + y;
target[to] = source[from];
}
Image: EgoAltere (CC0 Public Domain)
27. In Real Life
• This is not efficient
• Reads are sequential
0 1 2 3 ... 9
0
1
2
3
…
9
28. In Real Life
• This is not efficient
• Reads are sequential
0 1 2 3 ... 9
0 0 1 2 3 … 9
1
2
3
…
9
29. In Real Life
• This is not efficient
• Reads are sequential
• Writes aren’t, though
• Different strides
– Worst case wins :-(
0 1 2 3 ... 9
0 0 1 2 3 … 9
1 10
2 20
3 30
… …
9 90
30. Cache-Friendly Algorithms
• Use blocking or tiling
for (y = 0; y < height; y += blockHeight)
for (x = 0; x < width; x += blockWidth)
for (by = 0; by < blockHeight; by++)
for (bx = 0; bx < blockWidth; bx++) {
int from = (y + by) * width + (x + bx);
int to = (x + bx) * height + (y + by);
target[to] = source[from];
}
32. 2. Those Pesky Branches
• Do I go left or right?
• Need input!
• … but can’t wait for it
• Maybe...
– Take a guess?
– Based on historic trends?
• Sounds speculative
Image: Michael Dolan (CC BY 2.0)
33. Those Pesky Branches
• Enter: Branch Prediction
• Concurrently:
– Speculate branch
– Evaluate condition
• It’s now a tradeoff
– Commit is fast
– Rollback is slow
Image: Alejandro C. (CC BY-NC 2.0)
34. // Generate a bunch of bytes
byte[] data = new byte[32768];
new Random().nextBytes(data);
Arrays.sort(data);
// Sum positive elements
long sum = 0;
for (int i = 0; i < data.length; i++)
if (data[i] >= 0)
sum += data[i];
Back to Our Conundrum
• Can you guess?
– 3…
– 2...
– 1...
• Here it is!
// Generate a bunch of bytes
byte[] data = new byte[32768];
new Random().nextBytes(data);
Arrays.sort(data);
// Sum positive elements
long sum = 0;
for (int i = 0; i < data.length; i++)
if (data[i] >= 0)
sum += data[i];
37. QUESTIONS?
Thank you for listening
tomer@tomergabel.com
@tomerg
http://engineering.wix.com
Sources and Examples:
https://goo.gl/f7NfGT
This work is licensed under a Creative
Commons Attribution-ShareAlike 4.0
International License.
38. Further Reading
• Jason Robert Carey Patterson –
Modern Microprocessors, a 90-Minute Guide
• Igor Ostrovsky - Gallery of Processor Cache
Effects
• Piyush Kumar –
Cache Oblivious Algorithms