SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
How shit works:
the CPU
Tomer Gabel
BuildStuff 2016 Lithuania
Image: Telecarlos (CC BY-SA 3.0)
Full Disclosure
Bullshit ahead!
• I’m not an expert
• Explanations may be:
– Simplified
– Inaccurate
– Wrong :-)
• We’ll barely scratch the
surface
Image: Public Domain
A CONUNDRUM?
Are you ready for…
Image: Louis Reed (CC BY-SA 4.0)
Setting the Stage
// Generate a bunch of bytes
byte[] data = new byte[32768];
new Random().nextBytes(data);
Arrays.sort(data);
// Sum positive elements
long sum = 0;
for (int i = 0; i < data.length; i++)
if (data[i] >= 0)
sum += data[i];
1. Which is faster?
2. By how much?
3. And crucially…
why?!
# Run complete. Total time: 00:00:32
Benchmark Mode Cnt Score Error Units
Baseline.sum avgt 6 115.666 ± 3.137 us/op
Presorted.sum avgt 6 13.741 ± 0.524 us/op
Surprise, Terror and Ruthless Efficiency
# Run complete. Total time: 00:00:32
Benchmark Mode Cnt Error Units
Baseline.sum avgt 6 ± 3.137 us/op
Presorted.sum avgt 6 ± 0.524 us/op
* Ignoring setup cost
CPUS ARE
COMPLEX
BEASTS.
Image: Pauli Rautakorpi (CC BY 3.0)
It Is Known
• Your high-level code…
long sum = 0;
for (i = 0; i < length; i++)
if (data[i] >= 0)
sum += data[i];
• Gets compiled down to…
movsx eax,BYTE PTR [rax+rdx*1+0x10]
cmp eax,0x0
movabs rdx,0x11f3a9f60
movabs rcx,0x128
jl 0x000000010679e077
movabs rcx,0x138
mov r8,QWORD PTR [rdx+rcx*1]
lea r8,[r8+0x1]
mov QWORD PTR [rdx+rcx*1],r8
jl 0x000000010679e092
movsxd rax,eax
add rax,rbx
mov rbx,rax
inc edi
It Is Less Known
• What happens then?
• The instruction goes through phases…
Fetch Decode Execute
Memory
Access
Write-
back
Instruction
Stream
CPU Architecture 101
Image: Appaloosa (CC BY-SA 3.0)
CPU Architecture 101
• What does a CPU do?
– Reads the program
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
– Talks to memory
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
– Talks to memory
– Performs I/O
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
– Talks to memory
– Performs I/O
• Immense complexity!
Execution Units
• Arithmetic-Logic Unit (ALU)
– Boolean algebra
– Arithmetic
– Memory accesses
– Flow control
• Floating Point Unit (FPU)
• Memory Management Unit (MMU)
– Memory mapping
– Paging
– Access control
Images: ALU by Dirk Oppelt (CC BY-SA 3.0), FPU by Konstantin Lanzet (CC BY-SA 3.0), MMU from unknown source
DESIGN
CONSIDERATIONS
Image: William M. Plate Jr. (Public Domain)
Fetch Decode Execute
Memory
Access
Write-
back
Fetch Decode Execute
Memory
Access
Write-
back
Fetch Decode Execute
Memory
Access
Write-
back
I1
I0
I2
Pipelining
Sequential Execution
Latency = 5 cycles
Throughput= 0.2 ops / cycle
Fetch Decode Execute
Memory
Access
Write-
back
I1
I0
I2
Fetch Decode Execute
Memory
Access
Fetch Decode Execute
Pipelining
Sequential Execution Pipelined Execution
Latency = 5 cycles
Throughput= 0.2 ops / cycle
Latency = 5 cycles
Throughput= 1 ops / cycle
Fetch Decode Execute
Memory
Access
Write-
back
Fetch Decode Execute
Memory
Access
Write-
back
Fetch Decode Execute
Memory
Access
Write-
back
I1
I0
I2
Pipelining
• A pipeline can stall
• This happens with:
– Branches
if (i < 0) i++ else i--;
F D E M WMemory Load
F D E MTest
F D EConditional
Jump
? ????
F D E M WIncrement
memory address
F D E M
F D Stall
F D
Load from
memory
Add +1
Store in
memory
Pipelining
• A pipeline can stall
• This happens with:
– Branches
– Dependent Instructions
• A.K.A pipeline bubbling
i++;
x = i + 1;
Stall
PRACTICAL
RAMIFICATIONS
Image: Hangsna (CC BY-SA 3.0)
1. Memory is Slow
• RAM access is ~60ns
• Random access on a
4GHz, 64-bit CPU:
– 250 cycles / memory access
– 130MB / second bandwidth
• Surely we can do better!
Image: Noah Wieder (Public Domain)
Source: 7-cpu.com
Enter: CPU Cache
Level Size Latency
L1 32KB + 32KB 1ns
L2 256KB 3ns
L3 4MB 11ns
Main Memory 62ns
Intel i7-6700 “Skylake” at 4 GHz
Image: Ferry24.Milan (CC BY-SA 3.0)
Source: 7-cpu.com
Enter: CPU Cache
• A unit of work is
called cache line
– 64 bytes on x86
– LRU eviction policy
• Why is sequential
access fast?
– Cache prefetching
In Real Life
• Let’s rotate an image!
for (y = 0; y < height; y++)
for (x = 0; x < width; x++) {
int from = y * width + x;
int to = x * height + y;
target[to] = source[from];
}
Image: EgoAltere (CC0 Public Domain)
In Real Life
• This is not efficient
• Reads are sequential
0 1 2 3 ... 9
0
1
2
3
…
9
In Real Life
• This is not efficient
• Reads are sequential
0 1 2 3 ... 9
0 0 1 2 3 … 9
1
2
3
…
9
In Real Life
• This is not efficient
• Reads are sequential
• Writes aren’t, though
• Different strides
– Worst case wins :-(
0 1 2 3 ... 9
0 0 1 2 3 … 9
1 10
2 20
3 30
… …
9 90
Cache-Friendly Algorithms
• Use blocking or tiling
for (y = 0; y < height; y += blockHeight)
for (x = 0; x < width; x += blockWidth)
for (by = 0; by < blockHeight; by++)
for (bx = 0; bx < blockWidth; bx++) {
int from = (y + by) * width + (x + bx);
int to = (x + bx) * height + (y + by);
target[to] = source[from];
}
Cache-Friendly Algorithms
• The results?
Benchmark Mode Cnt Score Error Units
CachingShowcase.transposeNaive avgt 10 43.851 ± 6.000 ms/op
CachingShowcase.transposeTiled8x8 avgt 10 20.641 ± 1.646 ms/op
CachingShowcase.transposeTiled16x16 avgt 10 18.515 ± 1.833 ms/op
CachingShowcase.transposeTiled48x48 avgt 10 21.941 ± 1.954 ms/op
• The results?
Benchmark Mode Cnt Error Units
CachingShowcase.transpose avgt 10 ± 6.000 ms/op
CachingShowcase.transpose avgt 10 ± 1.646 ms/op
CachingShowcase.transpose avgt 10 ± 1.833 ms/op
CachingShowcase.transpose avgt 10 ± 1.954 ms/op
x2.37 speedup!
2. Those Pesky Branches
• Do I go left or right?
• Need input!
• … but can’t wait for it
• Maybe...
– Take a guess?
– Based on historic trends?
• Sounds speculative
Image: Michael Dolan (CC BY 2.0)
Those Pesky Branches
• Enter: Branch Prediction
• Concurrently:
– Speculate branch
– Evaluate condition
• It’s now a tradeoff
– Commit is fast
– Rollback is slow
Image: Alejandro C. (CC BY-NC 2.0)
// Generate a bunch of bytes
byte[] data = new byte[32768];
new Random().nextBytes(data);
Arrays.sort(data);
// Sum positive elements
long sum = 0;
for (int i = 0; i < data.length; i++)
if (data[i] >= 0)
sum += data[i];
Back to Our Conundrum
• Can you guess?
– 3…
– 2...
– 1...
• Here it is!
// Generate a bunch of bytes
byte[] data = new byte[32768];
new Random().nextBytes(data);
Arrays.sort(data);
// Sum positive elements
long sum = 0;
for (int i = 0; i < data.length; i++)
if (data[i] >= 0)
sum += data[i];
Catharsis
54 10 -4 -2 15 41
-
37
13 0 -9 14 25
-
61
40
Original data array:
Catharsis
-
61
-
37
-9 -4 -2 0 10 13 14 15 25 40 41 54
After sorting:
0
data[i] >= 0
Always false!
data[i] >= 0
Always true!
QUESTIONS?
Thank you for listening
tomer@tomergabel.com
@tomerg
http://engineering.wix.com
Sources and Examples:
https://goo.gl/f7NfGT
This work is licensed under a Creative
Commons Attribution-ShareAlike 4.0
International License.
Further Reading
• Jason Robert Carey Patterson –
Modern Microprocessors, a 90-Minute Guide
• Igor Ostrovsky - Gallery of Processor Cache
Effects
• Piyush Kumar –
Cache Oblivious Algorithms

Más contenido relacionado

La actualidad más candente

JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
Charles Nutter
 
JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013
Vladimir Ivanov
 

La actualidad más candente (20)

Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Advanced heap exploitaion
Advanced heap exploitaionAdvanced heap exploitaion
Advanced heap exploitaion
 
Xilinx Data Center Strategy and CCIX
Xilinx Data Center Strategy and CCIXXilinx Data Center Strategy and CCIX
Xilinx Data Center Strategy and CCIX
 
How Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and SaferHow Scylla Make Adding and Removing Nodes Faster and Safer
How Scylla Make Adding and Removing Nodes Faster and Safer
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
 
Kernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uring
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
 
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
 
JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013
 
Introduction to gdb
Introduction to gdbIntroduction to gdb
Introduction to gdb
 
Page reclaim
Page reclaimPage reclaim
Page reclaim
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
TMUX Rocks!
TMUX Rocks!TMUX Rocks!
TMUX Rocks!
 
Linux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKBLinux Kernel Booting Process (2) - For NLKB
Linux Kernel Booting Process (2) - For NLKB
 

Destacado

Destacado (12)

How Shit Works: Storage
How Shit Works: StorageHow Shit Works: Storage
How Shit Works: Storage
 
The Wix Microservice Stack
The Wix Microservice StackThe Wix Microservice Stack
The Wix Microservice Stack
 
Financial Portfolio Management with Java on Steroids - JAX Finance 2016
Financial Portfolio Management with Java on Steroids - JAX Finance 2016Financial Portfolio Management with Java on Steroids - JAX Finance 2016
Financial Portfolio Management with Java on Steroids - JAX Finance 2016
 
Onboarding at Scale
Onboarding at ScaleOnboarding at Scale
Onboarding at Scale
 
5 Bullets to Scala Adoption
5 Bullets to Scala Adoption5 Bullets to Scala Adoption
5 Bullets to Scala Adoption
 
Four hands
Four handsFour hands
Four hands
 
Disturbios de aprendizagem
Disturbios de aprendizagemDisturbios de aprendizagem
Disturbios de aprendizagem
 
безсмертна пам’ять
безсмертна      пам’ятьбезсмертна      пам’ять
безсмертна пам’ять
 
Cualidades del personal del futuro
Cualidades del personal del futuroCualidades del personal del futuro
Cualidades del personal del futuro
 
Scala Back to Basics: Type Classes
Scala Back to Basics: Type ClassesScala Back to Basics: Type Classes
Scala Back to Basics: Type Classes
 
Put Your Thinking CAP On
Put Your Thinking CAP OnPut Your Thinking CAP On
Put Your Thinking CAP On
 
Scala in practice
Scala in practiceScala in practice
Scala in practice
 

Similar a How shit works: the CPU

HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe ShockwaveHES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
Hackito Ergo Sum
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Peter Hlavaty
 
A New Tracer for Reverse Engineering - PacSec 2010
A New Tracer for Reverse Engineering - PacSec 2010A New Tracer for Reverse Engineering - PacSec 2010
A New Tracer for Reverse Engineering - PacSec 2010
Tsukasa Oi
 

Similar a How shit works: the CPU (20)

HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe ShockwaveHES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgi
 
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce RichardsonThe 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Pitfalls of Object Oriented Programming
Pitfalls of Object Oriented ProgrammingPitfalls of Object Oriented Programming
Pitfalls of Object Oriented Programming
 
Velocity 2012 - Learning WebOps the Hard Way
Velocity 2012 - Learning WebOps the Hard WayVelocity 2012 - Learning WebOps the Hard Way
Velocity 2012 - Learning WebOps the Hard Way
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
 
LST Toolkit: Exfiltration Over Sound, Light, Touch
LST Toolkit: Exfiltration Over Sound, Light, TouchLST Toolkit: Exfiltration Over Sound, Light, Touch
LST Toolkit: Exfiltration Over Sound, Light, Touch
 
A New Tracer for Reverse Engineering - PacSec 2010
A New Tracer for Reverse Engineering - PacSec 2010A New Tracer for Reverse Engineering - PacSec 2010
A New Tracer for Reverse Engineering - PacSec 2010
 
Practical SPU Programming in God of War III
Practical SPU Programming in God of War IIIPractical SPU Programming in God of War III
Practical SPU Programming in God of War III
 
Steelcon 2014 - Process Injection with Python
Steelcon 2014 - Process Injection with PythonSteelcon 2014 - Process Injection with Python
Steelcon 2014 - Process Injection with Python
 
PyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc AltedPyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc Alted
 
Ice Age melting down: Intel features considered usefull!
Ice Age melting down: Intel features considered usefull!Ice Age melting down: Intel features considered usefull!
Ice Age melting down: Intel features considered usefull!
 
Sheepdog Status Report
Sheepdog Status ReportSheepdog Status Report
Sheepdog Status Report
 
Unity - Internals: memory and performance
Unity - Internals: memory and performanceUnity - Internals: memory and performance
Unity - Internals: memory and performance
 
The Quantum Physics of Java
The Quantum Physics of JavaThe Quantum Physics of Java
The Quantum Physics of Java
 
Meltdown & Spectre
Meltdown & Spectre Meltdown & Spectre
Meltdown & Spectre
 

Más de Tomer Gabel

Más de Tomer Gabel (20)

How shit works: Time
How shit works: TimeHow shit works: Time
How shit works: Time
 
Nondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of UsNondeterministic Software for the Rest of Us
Nondeterministic Software for the Rest of Us
 
Slaying Sacred Cows: Deconstructing Dependency Injection
Slaying Sacred Cows: Deconstructing Dependency InjectionSlaying Sacred Cows: Deconstructing Dependency Injection
Slaying Sacred Cows: Deconstructing Dependency Injection
 
An Abridged Guide to Event Sourcing
An Abridged Guide to Event SourcingAn Abridged Guide to Event Sourcing
An Abridged Guide to Event Sourcing
 
Java 8 and Beyond, a Scala Story
Java 8 and Beyond, a Scala StoryJava 8 and Beyond, a Scala Story
Java 8 and Beyond, a Scala Story
 
Scala Refactoring for Fun and Profit (Japanese subtitles)
Scala Refactoring for Fun and Profit (Japanese subtitles)Scala Refactoring for Fun and Profit (Japanese subtitles)
Scala Refactoring for Fun and Profit (Japanese subtitles)
 
Scala Refactoring for Fun and Profit
Scala Refactoring for Fun and ProfitScala Refactoring for Fun and Profit
Scala Refactoring for Fun and Profit
 
Scala in the Wild
Scala in the WildScala in the Wild
Scala in the Wild
 
Speaking Scala: Refactoring for Fun and Profit (Workshop)
Speaking Scala: Refactoring for Fun and Profit (Workshop)Speaking Scala: Refactoring for Fun and Profit (Workshop)
Speaking Scala: Refactoring for Fun and Profit (Workshop)
 
Leveraging Scala Macros for Better Validation
Leveraging Scala Macros for Better ValidationLeveraging Scala Macros for Better Validation
Leveraging Scala Macros for Better Validation
 
A Field Guide to DSL Design in Scala
A Field Guide to DSL Design in ScalaA Field Guide to DSL Design in Scala
A Field Guide to DSL Design in Scala
 
Functional Leap of Faith (Keynote at JDay Lviv 2014)
Functional Leap of Faith (Keynote at JDay Lviv 2014)Functional Leap of Faith (Keynote at JDay Lviv 2014)
Functional Leap of Faith (Keynote at JDay Lviv 2014)
 
Nashorn: JavaScript that doesn’t suck (ILJUG)
Nashorn: JavaScript that doesn’t suck (ILJUG)Nashorn: JavaScript that doesn’t suck (ILJUG)
Nashorn: JavaScript that doesn’t suck (ILJUG)
 
Ponies and Unicorns With Scala
Ponies and Unicorns With ScalaPonies and Unicorns With Scala
Ponies and Unicorns With Scala
 
Lab: JVM Production Debugging 101
Lab: JVM Production Debugging 101Lab: JVM Production Debugging 101
Lab: JVM Production Debugging 101
 
DevCon³: Scala Best Practices
DevCon³: Scala Best PracticesDevCon³: Scala Best Practices
DevCon³: Scala Best Practices
 
Maven for Dummies
Maven for DummiesMaven for Dummies
Maven for Dummies
 
SHC Israel: GigaSpaces Case Study
SHC Israel: GigaSpaces Case StudySHC Israel: GigaSpaces Case Study
SHC Israel: GigaSpaces Case Study
 
The Demoscene: A cursory introduction
The Demoscene: A cursory introductionThe Demoscene: A cursory introduction
The Demoscene: A cursory introduction
 
Video: What you never thought you might want to know
Video: What you never thought you might want to knowVideo: What you never thought you might want to know
Video: What you never thought you might want to know
 

Último

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Último (20)

How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 

How shit works: the CPU

  • 1. How shit works: the CPU Tomer Gabel BuildStuff 2016 Lithuania Image: Telecarlos (CC BY-SA 3.0)
  • 2. Full Disclosure Bullshit ahead! • I’m not an expert • Explanations may be: – Simplified – Inaccurate – Wrong :-) • We’ll barely scratch the surface Image: Public Domain
  • 3. A CONUNDRUM? Are you ready for… Image: Louis Reed (CC BY-SA 4.0)
  • 4. Setting the Stage // Generate a bunch of bytes byte[] data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i]; 1. Which is faster? 2. By how much? 3. And crucially… why?!
  • 5. # Run complete. Total time: 00:00:32 Benchmark Mode Cnt Score Error Units Baseline.sum avgt 6 115.666 ± 3.137 us/op Presorted.sum avgt 6 13.741 ± 0.524 us/op Surprise, Terror and Ruthless Efficiency # Run complete. Total time: 00:00:32 Benchmark Mode Cnt Error Units Baseline.sum avgt 6 ± 3.137 us/op Presorted.sum avgt 6 ± 0.524 us/op * Ignoring setup cost
  • 6. CPUS ARE COMPLEX BEASTS. Image: Pauli Rautakorpi (CC BY 3.0)
  • 7. It Is Known • Your high-level code… long sum = 0; for (i = 0; i < length; i++) if (data[i] >= 0) sum += data[i]; • Gets compiled down to… movsx eax,BYTE PTR [rax+rdx*1+0x10] cmp eax,0x0 movabs rdx,0x11f3a9f60 movabs rcx,0x128 jl 0x000000010679e077 movabs rcx,0x138 mov r8,QWORD PTR [rdx+rcx*1] lea r8,[r8+0x1] mov QWORD PTR [rdx+rcx*1],r8 jl 0x000000010679e092 movsxd rax,eax add rax,rbx mov rbx,rax inc edi
  • 8. It Is Less Known • What happens then? • The instruction goes through phases… Fetch Decode Execute Memory Access Write- back Instruction Stream
  • 9. CPU Architecture 101 Image: Appaloosa (CC BY-SA 3.0)
  • 10. CPU Architecture 101 • What does a CPU do? – Reads the program
  • 11. CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out
  • 12. CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out – Executes it
  • 13. CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out – Executes it – Talks to memory
  • 14. CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out – Executes it – Talks to memory – Performs I/O
  • 15. CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out – Executes it – Talks to memory – Performs I/O • Immense complexity!
  • 16. Execution Units • Arithmetic-Logic Unit (ALU) – Boolean algebra – Arithmetic – Memory accesses – Flow control • Floating Point Unit (FPU) • Memory Management Unit (MMU) – Memory mapping – Paging – Access control Images: ALU by Dirk Oppelt (CC BY-SA 3.0), FPU by Konstantin Lanzet (CC BY-SA 3.0), MMU from unknown source
  • 17. DESIGN CONSIDERATIONS Image: William M. Plate Jr. (Public Domain)
  • 18. Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back I1 I0 I2 Pipelining Sequential Execution Latency = 5 cycles Throughput= 0.2 ops / cycle
  • 19. Fetch Decode Execute Memory Access Write- back I1 I0 I2 Fetch Decode Execute Memory Access Fetch Decode Execute Pipelining Sequential Execution Pipelined Execution Latency = 5 cycles Throughput= 0.2 ops / cycle Latency = 5 cycles Throughput= 1 ops / cycle Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back I1 I0 I2
  • 20. Pipelining • A pipeline can stall • This happens with: – Branches if (i < 0) i++ else i--; F D E M WMemory Load F D E MTest F D EConditional Jump ? ????
  • 21. F D E M WIncrement memory address F D E M F D Stall F D Load from memory Add +1 Store in memory Pipelining • A pipeline can stall • This happens with: – Branches – Dependent Instructions • A.K.A pipeline bubbling i++; x = i + 1; Stall
  • 23. 1. Memory is Slow • RAM access is ~60ns • Random access on a 4GHz, 64-bit CPU: – 250 cycles / memory access – 130MB / second bandwidth • Surely we can do better! Image: Noah Wieder (Public Domain) Source: 7-cpu.com
  • 24. Enter: CPU Cache Level Size Latency L1 32KB + 32KB 1ns L2 256KB 3ns L3 4MB 11ns Main Memory 62ns Intel i7-6700 “Skylake” at 4 GHz Image: Ferry24.Milan (CC BY-SA 3.0) Source: 7-cpu.com
  • 25. Enter: CPU Cache • A unit of work is called cache line – 64 bytes on x86 – LRU eviction policy • Why is sequential access fast? – Cache prefetching
  • 26. In Real Life • Let’s rotate an image! for (y = 0; y < height; y++) for (x = 0; x < width; x++) { int from = y * width + x; int to = x * height + y; target[to] = source[from]; } Image: EgoAltere (CC0 Public Domain)
  • 27. In Real Life • This is not efficient • Reads are sequential 0 1 2 3 ... 9 0 1 2 3 … 9
  • 28. In Real Life • This is not efficient • Reads are sequential 0 1 2 3 ... 9 0 0 1 2 3 … 9 1 2 3 … 9
  • 29. In Real Life • This is not efficient • Reads are sequential • Writes aren’t, though • Different strides – Worst case wins :-( 0 1 2 3 ... 9 0 0 1 2 3 … 9 1 10 2 20 3 30 … … 9 90
  • 30. Cache-Friendly Algorithms • Use blocking or tiling for (y = 0; y < height; y += blockHeight) for (x = 0; x < width; x += blockWidth) for (by = 0; by < blockHeight; by++) for (bx = 0; bx < blockWidth; bx++) { int from = (y + by) * width + (x + bx); int to = (x + bx) * height + (y + by); target[to] = source[from]; }
  • 31. Cache-Friendly Algorithms • The results? Benchmark Mode Cnt Score Error Units CachingShowcase.transposeNaive avgt 10 43.851 ± 6.000 ms/op CachingShowcase.transposeTiled8x8 avgt 10 20.641 ± 1.646 ms/op CachingShowcase.transposeTiled16x16 avgt 10 18.515 ± 1.833 ms/op CachingShowcase.transposeTiled48x48 avgt 10 21.941 ± 1.954 ms/op • The results? Benchmark Mode Cnt Error Units CachingShowcase.transpose avgt 10 ± 6.000 ms/op CachingShowcase.transpose avgt 10 ± 1.646 ms/op CachingShowcase.transpose avgt 10 ± 1.833 ms/op CachingShowcase.transpose avgt 10 ± 1.954 ms/op x2.37 speedup!
  • 32. 2. Those Pesky Branches • Do I go left or right? • Need input! • … but can’t wait for it • Maybe... – Take a guess? – Based on historic trends? • Sounds speculative Image: Michael Dolan (CC BY 2.0)
  • 33. Those Pesky Branches • Enter: Branch Prediction • Concurrently: – Speculate branch – Evaluate condition • It’s now a tradeoff – Commit is fast – Rollback is slow Image: Alejandro C. (CC BY-NC 2.0)
  • 34. // Generate a bunch of bytes byte[] data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i]; Back to Our Conundrum • Can you guess? – 3… – 2... – 1... • Here it is! // Generate a bunch of bytes byte[] data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i];
  • 35. Catharsis 54 10 -4 -2 15 41 - 37 13 0 -9 14 25 - 61 40 Original data array:
  • 36. Catharsis - 61 - 37 -9 -4 -2 0 10 13 14 15 25 40 41 54 After sorting: 0 data[i] >= 0 Always false! data[i] >= 0 Always true!
  • 37. QUESTIONS? Thank you for listening tomer@tomergabel.com @tomerg http://engineering.wix.com Sources and Examples: https://goo.gl/f7NfGT This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
  • 38. Further Reading • Jason Robert Carey Patterson – Modern Microprocessors, a 90-Minute Guide • Igor Ostrovsky - Gallery of Processor Cache Effects • Piyush Kumar – Cache Oblivious Algorithms