1. “Bobcat”
AMD’s New Low Power x86 Core Architecture
Brad Burgess, AMD Fellow
Chief Architect / Bobcat Core
August 24, 2010
1 | Bobcat | Hot Chips 2010
2. Two x86 Cores Tuned for Target Markets
“Bulldozer”
Performance &
Scalability Mainstream Client and Server Markets
Low Power Small Cloud Clients
“Bobcat” Markets Die Area Optimized
Flexible, Low
Power & Small
2 | Bobcat | Hot Chips 2010
3. Bobcat Design Goals
A small, efficient, low power
x86 core
Excellent performance
Synthesizable with small
number of custom arrays
Easily Portable across process
technologies
3 | Bobcat | Hot Chips 2010
4. Feature Set
64-bit AMD64 x86 ISA
SIMD extensions: SSE1, SSE2,
SSE3, SSSE3, SSE4A
Virtualization
Support for misaligned 128-bit
data types
Instruction Based Sampling
(for dynamic optimization)
C6 (with integrated power gating)
4 | Bobcat | Hot Chips 2010
5. Micro-architecture Overview
Dual x86 instruction decode
Out-of-Order instruction execution
Dual COP retirement
Complex microOPs
State of the art branch prediction
Aggressive OOO load/store engine w/ hazard
prediction
Advanced Virtualization w/ nested page tables,
ASIDs and world switch acceleration
Low power C6 state w/ core level power gating and
state save acceleration
5 | Bobcat | Hot Chips 2010
6. Bobcat ITLB 32KB Branch Predictor
Micro-Architecture ICACHE Branch Locator ConditionPredict
or
Return Stack Dynamic Target
Fetch Queue
uCode Dual x86 Decoder
Instr Queue FP Decode
ROB
Int Rename FP Rename
FP Sched
Scheduler Scheduler
FP PRF
Int PRF
ALU ALU LAGU SAGU MMX Alu MMX Alu
Table Walker Mul IntMul St Conv
32KB LdSt FP Logical FP Logical
DTLB
DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU
L2CACHE To/from Northbridge
6 | Bobcat | Hot Chips 2010
7. Bobcat ITLB 32KB Branch Predictor
Micro-Architecture
ICACHE Branch Locator ConditionPredict
or
Return Stack Dynamic Target
Fetch Queue
Icache
uCode Dual x86 Decoder
32Kbyte
2-way set associative Instr Queue FP Decode
ROB
64-byte line
Int Rename FP Rename
Parity Protected
FP Sched
512/8 entry ITLB Scheduler Scheduler
(4k/2m) FP PRF
Int PRF
Fetch up to MMX Alu MMX Alu
ALU ALU LAGU SAGU
32-bytes/cycle
Table Walker Mul IntMul St Conv
32KB LdSt FP Logical FP Logical
DTLB
DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU
L2CACHE To/from Northbridge
7 | Bobcat | Hot Chips 2010
8. Bobcat ITLB 32KB Branch Predictor
Micro-Architecture ICACHE Branch Locator ConditionPredict
or
Return Stack Dynamic Target
Fetch Queue
Branch Predictor:
uCode Dual x86 Decoder
Predicts up to two
branches per cycle
Instr Queue FP Decode
Remembers branch ROB
instruction locations Int Rename FP Rename
Return Stack Address FP Sched
Predictor Scheduler Scheduler
FP PRF
Indirect Dynamic Int PRF
Address Predictor MMX Alu MMX Alu
ALU ALU LAGU SAGU
State of the Art Mul
Table Walker IntMul St Conv
condition Predictor
Only necessary DTLB
32KB LdSt FP Logical FP Logical
DCACHE Unit
structures are clocked FPAdd FPMul
Prefetch
512KB
BU
L2CACHE To/from Northbridge
8 | Bobcat | Hot Chips 2010
9. Bobcat ITLB 32KB Branch Predictor
Micro-Architecture ICACHE Branch Locator ConditionPredict
or
Return Stack Dynamic Target
Fetch Queue
Dual x86 Decoder:
uCode Dual x86 Decoder
Scans up to 22 bytes
Decodes up to two x86 Instr Queue FP Decode
instructions per cycle ROB
Int Rename FP Rename
The decoder can directly
map 89% of x86 FP Sched
instructions to a single Scheduler Scheduler
microOp, an additional Int PRF
FP PRF
10% to a pair of
MMX Alu MMX Alu
microOps, and more ALU ALU LAGU SAGU
complicated x86 Mul IntMul St Conv
Table Walker
instructions (<1%) are
microcoded. (Dynamic DTLB
32KB LdSt FP Logical FP Logical
Instruction Counts) DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU
L2CACHE To/from Northbridge
9 | Bobcat | Hot Chips 2010
10. Bobcat ITLB 32KB Branch Predictor
Micro-Architecture ICACHE Branch Locator ConditionPredict
or
Return Stack Dynamic Target
Fetch Queue
Integer Execution:
uCode Dual x86 Decoder
A dual port integer
scheduler feeds two ALUs
Instr Queue FP Decode
A dual port address ROB
scheduler feeds a load Int Rename FP Rename
address unit, and a store
address unit. Scheduler Scheduler
FP Sched
Physical Register File uses Int PRF
FP PRF
maps and pointers to
MMX Alu MMX Alu
reduce power by ALU ALU LAGU SAGU
minimizing data Mul IntMul St Conv
Table Walker
copying/movement.
32KB LdSt FP Logical FP Logical
DTLB
DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU
L2CACHE To/from Northbridge
10 | Bobcat | Hot Chips 2010
11. Bobcat ITLB 32KB Branch Predictor
Micro-Architecture ICACHE Branch Locator ConditionPredict
or
Return Stack Dynamic Target
Fetch Queue
Floating Point Unit:
uCode Dual x86 Decoder
A centralized FP scheduler
feeds two 64-bit FP
Instr Queue
execution stacks FP Decode
ROB
MMX and Logical units are Int Rename FP Rename
replicated in both stacks
FP Sched
The FP Mul Unit can Scheduler Scheduler
perform two SP multiplies Int PRF
FP PRF
per cycle
ALU ALU LAGU SAGU MMX Alu MMX Alu
The FP Add Unit can
perform two SP additions Table Walker Mul IntMul St Conv
per cycle
32KB LdSt FP Logical FP Logical
DTLB
A physical register file is DCACHE Unit
used to reduce power Prefetch
FPAdd FPMul
512KB
BU
L2CACHE To/from Northbridge
11 | Bobcat | Hot Chips 2010
12. Bobcat ITLB 32KB Branch Predictor
Micro-Architecture ICACHE Branch Locator ConditionPredict
or
Return Stack Dynamic Target
Fetch Queue
Data Cache:
uCode Dual x86 Decoder
32-Kbyte
8-way set associative Instr Queue FP Decode
ROB
64-byte line
Int Rename FP Rename
Parity Protected
FP Sched
Copyback Scheduler Scheduler
40/8 entry L1DTLB Int PRF
FP PRF
(4k/2m) MMX Alu MMX Alu
ALU ALU LAGU SAGU
512/64 entry L2DTLB
Mul IntMul St Conv
(4k/2m) Table Walker
Advanced 8-stream DTLB
32KB LdSt FP Logical FP Logical
prefetcher DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU
L2CACHE To/from Northbridge
12 | Bobcat | Hot Chips 2010
13. Bobcat ITLB 32KB Branch Predictor
Micro-Architecture ICACHE Branch Locator ConditionPredict
or
Return Stack Dynamic Target
Fetch Queue
Out-of-Order Load
Store Unit: uCode Dual x86 Decoder
Loads bypassing loads Instr Queue FP Decode
Loads bypassing stores ROB
Int Rename FP Rename
Stores bypassing loads
Bypass tracking and Scheduler Scheduler
FP Sched
dependency correction FP PRF
Int PRF
Hazard predictor
ALU ALU LAGU SAGU MMX Alu MMX Alu
Fast store forwarding
Table Walker Mul IntMul St Conv
Fast critical word fill
forwarding 32KB LdSt FP Logical FP Logical
DTLB
DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU
L2CACHE To/from Northbridge
13 | Bobcat | Hot Chips 2010
14. Bobcat ITLB 32KB Branch Predictor
Micro-Architecture ICACHE Branch Locator ConditionPredict
or
Return Stack Dynamic Target
Fetch Queue
L2 Cache:
uCode Dual x86 Decoder
512Kbyte
16-way set associative Instr Queue FP Decode
ROB
64 byte lines
Int Rename FP Rename
ECC Protected
FP Sched
Half speed clocking for Scheduler Scheduler
power reduction FP PRF
Int PRF
ALU ALU LAGU SAGU MMX Alu MMX Alu
Table Walker Mul IntMul St Conv
32KB LdSt FP Logical FP Logical
DTLB
DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU
L2CACHE To/from Northbridge
14 | Bobcat | Hot Chips 2010
15. Bobcat ITLB 32KB Branch Predictor
Micro-Architecture ICACHE Branch Locator ConditionPredict
or
Return Stack Dynamic Target
Fetch Queue
Bus Unit:
uCode Dual x86 Decoder
8-outstanding data
accesses
Instr Queue FP Decode
2-outstanding fetch ROB
accesses Int Rename FP Rename
Eviction Buffers FP Sched
Scheduler Scheduler
Fill Buffers
FP PRF
Int PRF
Write combining buffers
ALU ALU LAGU SAGU MMX Alu MMX Alu
Coherency management
Table Walker Mul IntMul St Conv
32KB LdSt FP Logical FP Logical
DTLB
DCACHE Unit
FPAdd FPMul
Prefetch
512KB
BU
L2CACHE To/from Northbridge
15 | Bobcat | Hot Chips 2010
17. Core Floor Plan
Floating Point Unit
Test/Debug Data L2 TLB
X86 Decode Bus Unit
Instruction
Cache L2 Sub Array
Inst
TLB/Tag
L2 TAG
Branch
Predict
Ucode
ROM
ROB Data Cache
Integer Unit Data Tag/TLB
Load Store Unit
17 | Bobcat | Hot Chips 2010
18. Power Reduction
Use of physical Register files
Extensive use of non-shifting queues with
pointers
Fine grain clock gating
Integrated Core Power Gating
Only needed arrays are clocked
– i.e. Dtag hit before Dcache read
– Predicting the type of branch then clocking the
appropriate predictor(s)
Elimination of instruction marker bits in the
Icache
Finding the knee of the curve (scrutinize
performance gains against power costs)
Polishing speed paths to raise the Vt mix
and reduce leakage
18 | Bobcat | Hot Chips 2010
19. Bobcat Core Overview
Advanced Micro-architecture
Dual x86 Decode ICACHE
Advanced Branch Predictor Bobcat L2
Full OOO instruction execution Low Fetch
Full OOO load/store engine Power
High Performance Floating Point Core
AMD64 64-bit ISA Decode BU
SSE1,2,3, SSSE3 ISA
Secure Virtualization
32kb L1s, 512kb L2
Low Power Design Integer Address FP
Power Optimized Execution Scheduler Scheduler Scheduler
Micro-architecture that minimizes data movement
and unnecessary reads
I I Load Store A M
Clock gating, Power gating Pipe Pipe Pipe Pipe Pipe Pipe
System Low Power States
Small Core
DCACHE
Area efficient balance of high performance and low
power
19 | Bobcat | Hot Chips 2010
20. Summary
Estimated 90% of the performance of today’s
mainstream notebook CPU in half the area*
Sub-one watt capable
Highly portable across designs and
manufacturing technologies
20 | Bobcat | Hot Chips 2010 *Based on internal AMD modeling using benchmark simulations