SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
“Bobcat”
                              AMD’s New Low Power x86 Core Architecture

                              Brad Burgess, AMD Fellow
                              Chief Architect / Bobcat Core

                              August 24, 2010




1 | Bobcat | Hot Chips 2010
Two x86 Cores Tuned for Target Markets


                       “Bulldozer”
                        Performance &
                          Scalability   Mainstream Client and Server Markets




                                        Low Power     Small     Cloud Clients
                        “Bobcat”         Markets     Die Area     Optimized


                      Flexible, Low
                     Power & Small


   2 | Bobcat | Hot Chips 2010
Bobcat Design Goals

 A small, efficient, low power
  x86 core

 Excellent performance

 Synthesizable with small
  number of custom arrays

 Easily Portable across process
  technologies




     3 | Bobcat | Hot Chips 2010
Feature Set
 64-bit AMD64 x86 ISA
 SIMD extensions: SSE1, SSE2,
  SSE3, SSSE3, SSE4A
 Virtualization
 Support for misaligned 128-bit
  data types
 Instruction Based Sampling
  (for dynamic optimization)
 C6 (with integrated power gating)




      4 | Bobcat | Hot Chips 2010
Micro-architecture Overview
 Dual x86 instruction decode
 Out-of-Order instruction execution
 Dual COP retirement
 Complex microOPs
 State of the art branch prediction
 Aggressive OOO load/store engine w/ hazard
  prediction
 Advanced Virtualization w/ nested page tables,
  ASIDs and world switch acceleration
 Low power C6 state w/ core level power gating and
  state save acceleration

      5 | Bobcat | Hot Chips 2010
Bobcat                                ITLB                        32KB                            Branch Predictor

Micro-Architecture                                              ICACHE                    Branch Locator      ConditionPredict
                                                                                                                     or
                                                                                           Return Stack       Dynamic Target
                                                             Fetch Queue

                                      uCode               Dual x86 Decoder

                                                             Instr Queue                                   FP Decode
                                      ROB
                                                             Int Rename                                    FP Rename

                                                                                                              FP Sched
                                                    Scheduler             Scheduler
                                                                                                              FP PRF
                                                                Int PRF

                                                  ALU     ALU        LAGU          SAGU            MMX Alu           MMX Alu

                                   Table Walker   Mul                                                IntMul            St Conv

                                                    32KB                     LdSt                 FP Logical         FP Logical
                                      DTLB
                                                   DCACHE                    Unit
                                                                                                    FPAdd                FPMul
                                     Prefetch
                                                    512KB
                                                                              BU
                                                   L2CACHE                                            To/from Northbridge



     6 | Bobcat | Hot Chips 2010
Bobcat                                  ITLB                        32KB                            Branch Predictor

Micro-Architecture
                                                                  ICACHE                    Branch Locator      ConditionPredict
                                                                                                                       or
                                                                                             Return Stack       Dynamic Target
                                                               Fetch Queue
Icache
                                        uCode               Dual x86 Decoder
 32Kbyte
 2-way set associative                                        Instr Queue                                   FP Decode
                                        ROB
 64-byte line
                                                               Int Rename                                    FP Rename
 Parity Protected
                                                                                                                FP Sched
 512/8 entry ITLB                                    Scheduler             Scheduler
  (4k/2m)                                                                                                       FP PRF
                                                                  Int PRF
 Fetch up to                                                                                        MMX Alu           MMX Alu
                                                    ALU     ALU        LAGU          SAGU
  32-bytes/cycle
                                     Table Walker   Mul                                                IntMul            St Conv

                                                      32KB                     LdSt                 FP Logical         FP Logical
                                        DTLB
                                                     DCACHE                    Unit
                                                                                                      FPAdd                FPMul
                                       Prefetch
                                                      512KB
                                                                                BU
                                                     L2CACHE                                            To/from Northbridge



       7 | Bobcat | Hot Chips 2010
Bobcat                                   ITLB                        32KB                            Branch Predictor

Micro-Architecture                                                 ICACHE                    Branch Locator      ConditionPredict
                                                                                                                        or
                                                                                              Return Stack       Dynamic Target
                                                                Fetch Queue
Branch Predictor:
                                         uCode               Dual x86 Decoder
 Predicts up to two
  branches per cycle
                                                                Instr Queue                                   FP Decode
 Remembers branch                       ROB
  instruction locations                                         Int Rename                                    FP Rename

 Return Stack Address                                                                                           FP Sched
  Predictor                                            Scheduler             Scheduler
                                                                                                                 FP PRF
 Indirect Dynamic                                                 Int PRF
  Address Predictor                                                                                   MMX Alu           MMX Alu
                                                     ALU     ALU        LAGU          SAGU
 State of the Art                                   Mul
                                      Table Walker                                                      IntMul            St Conv
  condition Predictor
 Only necessary                         DTLB
                                                       32KB                     LdSt                 FP Logical         FP Logical
                                                      DCACHE                    Unit
  structures are clocked                                                                               FPAdd                FPMul
                                        Prefetch
                                                       512KB
                                                                                 BU
                                                      L2CACHE                                            To/from Northbridge



        8 | Bobcat | Hot Chips 2010
Bobcat                                  ITLB                        32KB                            Branch Predictor

Micro-Architecture                                                ICACHE                    Branch Locator      ConditionPredict
                                                                                                                       or
                                                                                             Return Stack       Dynamic Target
                                                               Fetch Queue
Dual x86 Decoder:
                                        uCode               Dual x86 Decoder
 Scans up to 22 bytes
 Decodes up to two x86                                        Instr Queue                                   FP Decode
  instructions per cycle                ROB
                                                               Int Rename                                    FP Rename
 The decoder can directly
  map 89% of x86                                                                                                FP Sched
  instructions to a single                            Scheduler             Scheduler
  microOp, an additional                                          Int PRF
                                                                                                                FP PRF
  10% to a pair of
                                                                                                     MMX Alu           MMX Alu
  microOps, and more                                ALU     ALU        LAGU          SAGU
  complicated x86                                   Mul                                                IntMul            St Conv
                                     Table Walker
  instructions (<1%) are
  microcoded. (Dynamic                  DTLB
                                                      32KB                     LdSt                 FP Logical         FP Logical
  Instruction Counts)                                DCACHE                    Unit
                                                                                                      FPAdd                FPMul
                                       Prefetch
                                                      512KB
                                                                                BU
                                                     L2CACHE                                            To/from Northbridge



       9 | Bobcat | Hot Chips 2010
Bobcat                                   ITLB                        32KB                            Branch Predictor

Micro-Architecture                                                 ICACHE                    Branch Locator      ConditionPredict
                                                                                                                        or
                                                                                              Return Stack       Dynamic Target
                                                                Fetch Queue
Integer Execution:
                                         uCode               Dual x86 Decoder
 A dual port integer
  scheduler feeds two ALUs
                                                                Instr Queue                                   FP Decode
 A dual port address                    ROB
  scheduler feeds a load                                        Int Rename                                    FP Rename
  address unit, and a store
  address unit.                                        Scheduler             Scheduler
                                                                                                                 FP Sched

 Physical Register File uses                                      Int PRF
                                                                                                                 FP PRF
  maps and pointers to
                                                                                                      MMX Alu           MMX Alu
  reduce power by                                    ALU     ALU        LAGU          SAGU
  minimizing data                                    Mul                                                IntMul            St Conv
                                      Table Walker
  copying/movement.
                                                       32KB                     LdSt                 FP Logical         FP Logical
                                         DTLB
                                                      DCACHE                    Unit
                                                                                                       FPAdd                FPMul
                                        Prefetch
                                                       512KB
                                                                                 BU
                                                      L2CACHE                                            To/from Northbridge



       10 | Bobcat | Hot Chips 2010
Bobcat                                   ITLB                        32KB                            Branch Predictor

Micro-Architecture                                                 ICACHE                    Branch Locator      ConditionPredict
                                                                                                                        or
                                                                                              Return Stack       Dynamic Target
                                                                Fetch Queue
Floating Point Unit:
                                         uCode               Dual x86 Decoder
 A centralized FP scheduler
  feeds two 64-bit FP
                                                                Instr Queue
  execution stacks                                                                                            FP Decode
                                         ROB
 MMX and Logical units are                                     Int Rename                                    FP Rename
  replicated in both stacks
                                                                                                                 FP Sched
 The FP Mul Unit can                                  Scheduler             Scheduler
  perform two SP multiplies                                        Int PRF
                                                                                                                 FP PRF
  per cycle
                                                     ALU     ALU        LAGU          SAGU            MMX Alu           MMX Alu
 The FP Add Unit can
  perform two SP additions            Table Walker   Mul                                                IntMul            St Conv
  per cycle
                                                       32KB                     LdSt                 FP Logical         FP Logical
                                         DTLB
 A physical register file is                         DCACHE                    Unit
  used to reduce power                  Prefetch
                                                                                                       FPAdd                FPMul
                                                       512KB
                                                                                 BU
                                                      L2CACHE                                            To/from Northbridge



       11 | Bobcat | Hot Chips 2010
Bobcat                                  ITLB                        32KB                            Branch Predictor

Micro-Architecture                                                ICACHE                    Branch Locator      ConditionPredict
                                                                                                                       or
                                                                                             Return Stack       Dynamic Target
                                                               Fetch Queue
Data Cache:
                                        uCode               Dual x86 Decoder
 32-Kbyte
 8-way set associative                                        Instr Queue                                   FP Decode
                                        ROB
 64-byte line
                                                               Int Rename                                    FP Rename
 Parity Protected
                                                                                                                FP Sched
 Copyback                                            Scheduler             Scheduler

 40/8 entry L1DTLB                                               Int PRF
                                                                                                                FP PRF
  (4k/2m)                                                                                            MMX Alu           MMX Alu
                                                    ALU     ALU        LAGU          SAGU
 512/64 entry L2DTLB
                                                    Mul                                                IntMul            St Conv
  (4k/2m)                            Table Walker

 Advanced 8-stream                     DTLB
                                                      32KB                     LdSt                 FP Logical         FP Logical
  prefetcher                                         DCACHE                    Unit
                                                                                                      FPAdd                FPMul
                                       Prefetch
                                                      512KB
                                                                                BU
                                                     L2CACHE                                            To/from Northbridge



      12 | Bobcat | Hot Chips 2010
Bobcat                                   ITLB                        32KB                            Branch Predictor

Micro-Architecture                                                 ICACHE                    Branch Locator      ConditionPredict
                                                                                                                        or
                                                                                              Return Stack       Dynamic Target
                                                                Fetch Queue
Out-of-Order Load
 Store Unit:                             uCode               Dual x86 Decoder

 Loads bypassing loads                                         Instr Queue                                   FP Decode
 Loads bypassing stores                 ROB
                                                                Int Rename                                    FP Rename
 Stores bypassing loads
 Bypass tracking and                                  Scheduler             Scheduler
                                                                                                                 FP Sched
  dependency correction                                                                                          FP PRF
                                                                   Int PRF
 Hazard predictor
                                                     ALU     ALU        LAGU          SAGU            MMX Alu           MMX Alu
 Fast store forwarding
                                      Table Walker   Mul                                                IntMul            St Conv
 Fast critical word fill
  forwarding                                           32KB                     LdSt                 FP Logical         FP Logical
                                         DTLB
                                                      DCACHE                    Unit
                                                                                                       FPAdd                FPMul
                                        Prefetch
                                                       512KB
                                                                                 BU
                                                      L2CACHE                                            To/from Northbridge



       13 | Bobcat | Hot Chips 2010
Bobcat                                  ITLB                        32KB                            Branch Predictor

Micro-Architecture                                                ICACHE                    Branch Locator      ConditionPredict
                                                                                                                       or
                                                                                             Return Stack       Dynamic Target
                                                               Fetch Queue
L2 Cache:
                                        uCode               Dual x86 Decoder
 512Kbyte
 16-way set associative                                       Instr Queue                                   FP Decode
                                        ROB
 64 byte lines
                                                               Int Rename                                    FP Rename
 ECC Protected
                                                                                                                FP Sched
 Half speed clocking for                             Scheduler             Scheduler
  power reduction                                                                                               FP PRF
                                                                  Int PRF

                                                    ALU     ALU        LAGU          SAGU            MMX Alu           MMX Alu

                                     Table Walker   Mul                                                IntMul            St Conv

                                                      32KB                     LdSt                 FP Logical         FP Logical
                                        DTLB
                                                     DCACHE                    Unit
                                                                                                      FPAdd                FPMul
                                       Prefetch
                                                      512KB
                                                                                BU
                                                     L2CACHE                                            To/from Northbridge



      14 | Bobcat | Hot Chips 2010
Bobcat                                   ITLB                        32KB                            Branch Predictor

Micro-Architecture                                                 ICACHE                    Branch Locator      ConditionPredict
                                                                                                                        or
                                                                                              Return Stack       Dynamic Target
                                                                Fetch Queue
Bus Unit:
                                         uCode               Dual x86 Decoder
 8-outstanding data
  accesses
                                                                Instr Queue                                   FP Decode
 2-outstanding fetch                    ROB
  accesses                                                      Int Rename                                    FP Rename

 Eviction Buffers                                                                                               FP Sched
                                                       Scheduler             Scheduler
 Fill Buffers
                                                                                                                 FP PRF
                                                                   Int PRF
 Write combining buffers
                                                     ALU     ALU        LAGU          SAGU            MMX Alu           MMX Alu
 Coherency management
                                      Table Walker   Mul                                                IntMul            St Conv

                                                       32KB                     LdSt                 FP Logical         FP Logical
                                         DTLB
                                                      DCACHE                    Unit
                                                                                                       FPAdd                FPMul
                                        Prefetch
                                                       512KB
                                                                                 BU
                                                      L2CACHE                                            To/from Northbridge



       15 | Bobcat | Hot Chips 2010
Bobcat Pipeline
  0             1           2             3           4      5            6    7         8         9        10      11       12




Fetch0       Fetch1      Fetch2      Fetch3       Fetch4   Fetch5
                                                                              uCode
                                                                                       MDec        Branch Mispredict Latency
                                                                               ROM                         13-cycles


                                         Dec0     Dec1     Dec2       Pack    FDec    Dispatch   Schedule RegRead   ALU   Writeback




Transit      FpDec      RegRen      Schedule     RegRead    EXE   Writeback            AGU        DC1      DC2
                                                             EXE
                                                              EXE
                                                                               Load Use Latency
                                                                                 L1 hit: 3-cycles


Transit                                       L2Tag              L2Data

                                                                               Load Use Latency
                                                                                L2 hit: 17-cycles


          16 | Bobcat | Hot Chips 2010
Core Floor Plan
                                                  Floating Point Unit
                                        Test/Debug             Data L2 TLB

                               X86 Decode                               Bus Unit


         Instruction
              Cache                                                                  L2 Sub Array
        Inst
    TLB/Tag
                                                                                        L2 TAG
  Branch
  Predict

Ucode
 ROM



                                        ROB                             Data Cache
                                            Integer Unit        Data Tag/TLB

                                                    Load Store Unit


        17 | Bobcat | Hot Chips 2010
Power Reduction
 Use of physical Register files
 Extensive use of non-shifting queues with
  pointers
 Fine grain clock gating
 Integrated Core Power Gating
 Only needed arrays are clocked
  – i.e. Dtag hit before Dcache read
  – Predicting the type of branch then clocking the
    appropriate predictor(s)

 Elimination of instruction marker bits in the
  Icache
 Finding the knee of the curve (scrutinize
  performance gains against power costs)
 Polishing speed paths to raise the Vt mix
  and reduce leakage

      18 | Bobcat | Hot Chips 2010
Bobcat Core Overview
Advanced Micro-architecture
   Dual x86 Decode                                                      ICACHE
   Advanced Branch Predictor                            Bobcat                               L2
   Full OOO instruction execution                        Low             Fetch
   Full OOO load/store engine                           Power
   High Performance Floating Point                       Core
   AMD64 64-bit ISA                                                      Decode              BU
   SSE1,2,3, SSSE3 ISA
   Secure Virtualization
   32kb L1s, 512kb L2
Low Power Design                                          Integer         Address           FP
 Power Optimized Execution                              Scheduler       Scheduler       Scheduler
 Micro-architecture that minimizes data movement
  and unnecessary reads
                                                         I        I    Load    Store    A           M
 Clock gating, Power gating                           Pipe     Pipe   Pipe    Pipe    Pipe        Pipe
 System Low Power States
Small Core
                                                                         DCACHE
 Area efficient balance of high performance and low
  power


        19 | Bobcat | Hot Chips 2010
Summary

 Estimated 90% of the performance of today’s
  mainstream notebook CPU in half the area*

 Sub-one watt capable

 Highly portable across designs and
  manufacturing technologies




    20 | Bobcat | Hot Chips 2010   *Based on internal AMD modeling using benchmark simulations

Más contenido relacionado

La actualidad más candente

DFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core MicroprocessorsDFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core MicroprocessorsIshwar Parulkar
 
クラウドいろは勉強会
クラウドいろは勉強会クラウドいろは勉強会
クラウドいろは勉強会Daisuke Nakazawa
 
Riak intro to..
Riak intro to..Riak intro to..
Riak intro to..Adron Hall
 
AMD technologies for HPC
AMD technologies for HPCAMD technologies for HPC
AMD technologies for HPCJoshua Mora
 
Riak intro with azure
Riak   intro with azureRiak   intro with azure
Riak intro with azureAdron Hall
 
Define location of Preplaced cells(http://www.vlsisystemdesign.com/PD-Flow.php)
Define location of Preplaced cells(http://www.vlsisystemdesign.com/PD-Flow.php)Define location of Preplaced cells(http://www.vlsisystemdesign.com/PD-Flow.php)
Define location of Preplaced cells(http://www.vlsisystemdesign.com/PD-Flow.php)VLSI SYSTEM Design
 
Presentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_SaturnePresentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_SaturneRenuda SARL
 
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)Rob Shakir
 
Large customers want postgresql too !!
Large customers want postgresql too !!Large customers want postgresql too !!
Large customers want postgresql too !!rosensteel
 
A Better Python for the JVM
A Better Python for the JVMA Better Python for the JVM
A Better Python for the JVMTobias Lindaaker
 
Ios interior routing_protocols
Ios interior routing_protocolsIos interior routing_protocols
Ios interior routing_protocolsMohamed Gamel
 
ICEBreaker Presentation: Complex Sweep Plans for Automatic Component Characte...
ICEBreaker Presentation: Complex Sweep Plans for Automatic Component Characte...ICEBreaker Presentation: Complex Sweep Plans for Automatic Component Characte...
ICEBreaker Presentation: Complex Sweep Plans for Automatic Component Characte...NMDG NV
 
Msp430 assembly language instructions &amp;addressing modes
Msp430 assembly language instructions &amp;addressing modesMsp430 assembly language instructions &amp;addressing modes
Msp430 assembly language instructions &amp;addressing modesHarsha herle
 
Thesis L Leyssenne - November 27th 2009 - Part1
Thesis L Leyssenne - November 27th 2009 - Part1Thesis L Leyssenne - November 27th 2009 - Part1
Thesis L Leyssenne - November 27th 2009 - Part1Laurent Leyssenne
 

La actualidad más candente (17)

DFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core MicroprocessorsDFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core Microprocessors
 
Ibm power7
Ibm power7Ibm power7
Ibm power7
 
クラウドいろは勉強会
クラウドいろは勉強会クラウドいろは勉強会
クラウドいろは勉強会
 
Riak intro to..
Riak intro to..Riak intro to..
Riak intro to..
 
AMD technologies for HPC
AMD technologies for HPCAMD technologies for HPC
AMD technologies for HPC
 
Riak intro with azure
Riak   intro with azureRiak   intro with azure
Riak intro with azure
 
Developing an FTTx Ecosystem
Developing an FTTx EcosystemDeveloping an FTTx Ecosystem
Developing an FTTx Ecosystem
 
Define location of Preplaced cells(http://www.vlsisystemdesign.com/PD-Flow.php)
Define location of Preplaced cells(http://www.vlsisystemdesign.com/PD-Flow.php)Define location of Preplaced cells(http://www.vlsisystemdesign.com/PD-Flow.php)
Define location of Preplaced cells(http://www.vlsisystemdesign.com/PD-Flow.php)
 
Feature 8psk
Feature 8pskFeature 8psk
Feature 8psk
 
Presentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_SaturnePresentation of the open source CFD code Code_Saturne
Presentation of the open source CFD code Code_Saturne
 
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
 
Large customers want postgresql too !!
Large customers want postgresql too !!Large customers want postgresql too !!
Large customers want postgresql too !!
 
A Better Python for the JVM
A Better Python for the JVMA Better Python for the JVM
A Better Python for the JVM
 
Ios interior routing_protocols
Ios interior routing_protocolsIos interior routing_protocols
Ios interior routing_protocols
 
ICEBreaker Presentation: Complex Sweep Plans for Automatic Component Characte...
ICEBreaker Presentation: Complex Sweep Plans for Automatic Component Characte...ICEBreaker Presentation: Complex Sweep Plans for Automatic Component Characte...
ICEBreaker Presentation: Complex Sweep Plans for Automatic Component Characte...
 
Msp430 assembly language instructions &amp;addressing modes
Msp430 assembly language instructions &amp;addressing modesMsp430 assembly language instructions &amp;addressing modes
Msp430 assembly language instructions &amp;addressing modes
 
Thesis L Leyssenne - November 27th 2009 - Part1
Thesis L Leyssenne - November 27th 2009 - Part1Thesis L Leyssenne - November 27th 2009 - Part1
Thesis L Leyssenne - November 27th 2009 - Part1
 

Similar a Bobcat hotchips final 8 2 10

International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...Swiss Big Data User Group
 
Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationxKinAnx
 
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
 
Adv. FPGA Motor Control--EBV & Univ. of Koln: Embedded World 2010
Adv. FPGA Motor Control--EBV & Univ. of Koln: Embedded World 2010Adv. FPGA Motor Control--EBV & Univ. of Koln: Embedded World 2010
Adv. FPGA Motor Control--EBV & Univ. of Koln: Embedded World 2010Altera Corporation
 
Инновации в архитектуре маршрутизатора ASR9K. Технология сетевой витруализаци...
Инновации в архитектуре маршрутизатора ASR9K. Технология сетевой витруализаци...Инновации в архитектуре маршрутизатора ASR9K. Технология сетевой витруализаци...
Инновации в архитектуре маршрутизатора ASR9K. Технология сетевой витруализаци...Cisco Russia
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsShinya Takamaeda-Y
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processorAmol Barewar
 
Storage Performance Example
Storage Performance ExampleStorage Performance Example
Storage Performance Examplenigelwakefield
 
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - PosterEfficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - Posterrvernica
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neonsean chen
 
Vizuri exadata virtual
Vizuri exadata virtualVizuri exadata virtual
Vizuri exadata virtualZack Belcher
 
Introducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin ProcessorsIntroducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin ProcessorsAnalog Devices, Inc.
 
SDC20 ScaleFlux.pptx
SDC20 ScaleFlux.pptxSDC20 ScaleFlux.pptx
SDC20 ScaleFlux.pptxssuserabc741
 

Similar a Bobcat hotchips final 8 2 10 (20)

International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
 
Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentation
 
Fpga technology
Fpga technologyFpga technology
Fpga technology
 
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 
Adv. FPGA Motor Control--EBV & Univ. of Koln: Embedded World 2010
Adv. FPGA Motor Control--EBV & Univ. of Koln: Embedded World 2010Adv. FPGA Motor Control--EBV & Univ. of Koln: Embedded World 2010
Adv. FPGA Motor Control--EBV & Univ. of Koln: Embedded World 2010
 
Инновации в архитектуре маршрутизатора ASR9K. Технология сетевой витруализаци...
Инновации в архитектуре маршрутизатора ASR9K. Технология сетевой витруализаци...Инновации в архитектуре маршрутизатора ASR9K. Технология сетевой витруализаци...
Инновации в архитектуре маршрутизатора ASR9K. Технология сетевой витруализаци...
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processor
 
I3
I3I3
I3
 
Storage Performance Example
Storage Performance ExampleStorage Performance Example
Storage Performance Example
 
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - PosterEfficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neon
 
microprocessor
microprocessormicroprocessor
microprocessor
 
Vizuri exadata virtual
Vizuri exadata virtualVizuri exadata virtual
Vizuri exadata virtual
 
02 architecture
02 architecture02 architecture
02 architecture
 
Introducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin ProcessorsIntroducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin Processors
 
SDC20 ScaleFlux.pptx
SDC20 ScaleFlux.pptxSDC20 ScaleFlux.pptx
SDC20 ScaleFlux.pptx
 

Bobcat hotchips final 8 2 10

  • 1. “Bobcat” AMD’s New Low Power x86 Core Architecture Brad Burgess, AMD Fellow Chief Architect / Bobcat Core August 24, 2010 1 | Bobcat | Hot Chips 2010
  • 2. Two x86 Cores Tuned for Target Markets “Bulldozer” Performance & Scalability Mainstream Client and Server Markets Low Power Small Cloud Clients “Bobcat” Markets Die Area Optimized Flexible, Low Power & Small 2 | Bobcat | Hot Chips 2010
  • 3. Bobcat Design Goals  A small, efficient, low power x86 core  Excellent performance  Synthesizable with small number of custom arrays  Easily Portable across process technologies 3 | Bobcat | Hot Chips 2010
  • 4. Feature Set  64-bit AMD64 x86 ISA  SIMD extensions: SSE1, SSE2, SSE3, SSSE3, SSE4A  Virtualization  Support for misaligned 128-bit data types  Instruction Based Sampling (for dynamic optimization)  C6 (with integrated power gating) 4 | Bobcat | Hot Chips 2010
  • 5. Micro-architecture Overview  Dual x86 instruction decode  Out-of-Order instruction execution  Dual COP retirement  Complex microOPs  State of the art branch prediction  Aggressive OOO load/store engine w/ hazard prediction  Advanced Virtualization w/ nested page tables, ASIDs and world switch acceleration  Low power C6 state w/ core level power gating and state save acceleration 5 | Bobcat | Hot Chips 2010
  • 6. Bobcat ITLB 32KB Branch Predictor Micro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch Queue uCode Dual x86 Decoder Instr Queue FP Decode ROB Int Rename FP Rename FP Sched Scheduler Scheduler FP PRF Int PRF ALU ALU LAGU SAGU MMX Alu MMX Alu Table Walker Mul IntMul St Conv 32KB LdSt FP Logical FP Logical DTLB DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 6 | Bobcat | Hot Chips 2010
  • 7. Bobcat ITLB 32KB Branch Predictor Micro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch Queue Icache uCode Dual x86 Decoder  32Kbyte  2-way set associative Instr Queue FP Decode ROB  64-byte line Int Rename FP Rename  Parity Protected FP Sched  512/8 entry ITLB Scheduler Scheduler (4k/2m) FP PRF Int PRF  Fetch up to MMX Alu MMX Alu ALU ALU LAGU SAGU 32-bytes/cycle Table Walker Mul IntMul St Conv 32KB LdSt FP Logical FP Logical DTLB DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 7 | Bobcat | Hot Chips 2010
  • 8. Bobcat ITLB 32KB Branch Predictor Micro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch Queue Branch Predictor: uCode Dual x86 Decoder  Predicts up to two branches per cycle Instr Queue FP Decode  Remembers branch ROB instruction locations Int Rename FP Rename  Return Stack Address FP Sched Predictor Scheduler Scheduler FP PRF  Indirect Dynamic Int PRF Address Predictor MMX Alu MMX Alu ALU ALU LAGU SAGU  State of the Art Mul Table Walker IntMul St Conv condition Predictor  Only necessary DTLB 32KB LdSt FP Logical FP Logical DCACHE Unit structures are clocked FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 8 | Bobcat | Hot Chips 2010
  • 9. Bobcat ITLB 32KB Branch Predictor Micro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch Queue Dual x86 Decoder: uCode Dual x86 Decoder  Scans up to 22 bytes  Decodes up to two x86 Instr Queue FP Decode instructions per cycle ROB Int Rename FP Rename  The decoder can directly map 89% of x86 FP Sched instructions to a single Scheduler Scheduler microOp, an additional Int PRF FP PRF 10% to a pair of MMX Alu MMX Alu microOps, and more ALU ALU LAGU SAGU complicated x86 Mul IntMul St Conv Table Walker instructions (<1%) are microcoded. (Dynamic DTLB 32KB LdSt FP Logical FP Logical Instruction Counts) DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 9 | Bobcat | Hot Chips 2010
  • 10. Bobcat ITLB 32KB Branch Predictor Micro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch Queue Integer Execution: uCode Dual x86 Decoder  A dual port integer scheduler feeds two ALUs Instr Queue FP Decode  A dual port address ROB scheduler feeds a load Int Rename FP Rename address unit, and a store address unit. Scheduler Scheduler FP Sched  Physical Register File uses Int PRF FP PRF maps and pointers to MMX Alu MMX Alu reduce power by ALU ALU LAGU SAGU minimizing data Mul IntMul St Conv Table Walker copying/movement. 32KB LdSt FP Logical FP Logical DTLB DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 10 | Bobcat | Hot Chips 2010
  • 11. Bobcat ITLB 32KB Branch Predictor Micro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch Queue Floating Point Unit: uCode Dual x86 Decoder  A centralized FP scheduler feeds two 64-bit FP Instr Queue execution stacks FP Decode ROB  MMX and Logical units are Int Rename FP Rename replicated in both stacks FP Sched  The FP Mul Unit can Scheduler Scheduler perform two SP multiplies Int PRF FP PRF per cycle ALU ALU LAGU SAGU MMX Alu MMX Alu  The FP Add Unit can perform two SP additions Table Walker Mul IntMul St Conv per cycle 32KB LdSt FP Logical FP Logical DTLB  A physical register file is DCACHE Unit used to reduce power Prefetch FPAdd FPMul 512KB BU L2CACHE To/from Northbridge 11 | Bobcat | Hot Chips 2010
  • 12. Bobcat ITLB 32KB Branch Predictor Micro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch Queue Data Cache: uCode Dual x86 Decoder  32-Kbyte  8-way set associative Instr Queue FP Decode ROB  64-byte line Int Rename FP Rename  Parity Protected FP Sched  Copyback Scheduler Scheduler  40/8 entry L1DTLB Int PRF FP PRF (4k/2m) MMX Alu MMX Alu ALU ALU LAGU SAGU  512/64 entry L2DTLB Mul IntMul St Conv (4k/2m) Table Walker  Advanced 8-stream DTLB 32KB LdSt FP Logical FP Logical prefetcher DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 12 | Bobcat | Hot Chips 2010
  • 13. Bobcat ITLB 32KB Branch Predictor Micro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch Queue Out-of-Order Load Store Unit: uCode Dual x86 Decoder  Loads bypassing loads Instr Queue FP Decode  Loads bypassing stores ROB Int Rename FP Rename  Stores bypassing loads  Bypass tracking and Scheduler Scheduler FP Sched dependency correction FP PRF Int PRF  Hazard predictor ALU ALU LAGU SAGU MMX Alu MMX Alu  Fast store forwarding Table Walker Mul IntMul St Conv  Fast critical word fill forwarding 32KB LdSt FP Logical FP Logical DTLB DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 13 | Bobcat | Hot Chips 2010
  • 14. Bobcat ITLB 32KB Branch Predictor Micro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch Queue L2 Cache: uCode Dual x86 Decoder  512Kbyte  16-way set associative Instr Queue FP Decode ROB  64 byte lines Int Rename FP Rename  ECC Protected FP Sched  Half speed clocking for Scheduler Scheduler power reduction FP PRF Int PRF ALU ALU LAGU SAGU MMX Alu MMX Alu Table Walker Mul IntMul St Conv 32KB LdSt FP Logical FP Logical DTLB DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 14 | Bobcat | Hot Chips 2010
  • 15. Bobcat ITLB 32KB Branch Predictor Micro-Architecture ICACHE Branch Locator ConditionPredict or Return Stack Dynamic Target Fetch Queue Bus Unit: uCode Dual x86 Decoder  8-outstanding data accesses Instr Queue FP Decode  2-outstanding fetch ROB accesses Int Rename FP Rename  Eviction Buffers FP Sched Scheduler Scheduler  Fill Buffers FP PRF Int PRF  Write combining buffers ALU ALU LAGU SAGU MMX Alu MMX Alu  Coherency management Table Walker Mul IntMul St Conv 32KB LdSt FP Logical FP Logical DTLB DCACHE Unit FPAdd FPMul Prefetch 512KB BU L2CACHE To/from Northbridge 15 | Bobcat | Hot Chips 2010
  • 16. Bobcat Pipeline 0 1 2 3 4 5 6 7 8 9 10 11 12 Fetch0 Fetch1 Fetch2 Fetch3 Fetch4 Fetch5 uCode MDec Branch Mispredict Latency ROM 13-cycles Dec0 Dec1 Dec2 Pack FDec Dispatch Schedule RegRead ALU Writeback Transit FpDec RegRen Schedule RegRead EXE Writeback AGU DC1 DC2 EXE EXE Load Use Latency L1 hit: 3-cycles Transit L2Tag L2Data Load Use Latency L2 hit: 17-cycles 16 | Bobcat | Hot Chips 2010
  • 17. Core Floor Plan Floating Point Unit Test/Debug Data L2 TLB X86 Decode Bus Unit Instruction Cache L2 Sub Array Inst TLB/Tag L2 TAG Branch Predict Ucode ROM ROB Data Cache Integer Unit Data Tag/TLB Load Store Unit 17 | Bobcat | Hot Chips 2010
  • 18. Power Reduction  Use of physical Register files  Extensive use of non-shifting queues with pointers  Fine grain clock gating  Integrated Core Power Gating  Only needed arrays are clocked – i.e. Dtag hit before Dcache read – Predicting the type of branch then clocking the appropriate predictor(s)  Elimination of instruction marker bits in the Icache  Finding the knee of the curve (scrutinize performance gains against power costs)  Polishing speed paths to raise the Vt mix and reduce leakage 18 | Bobcat | Hot Chips 2010
  • 19. Bobcat Core Overview Advanced Micro-architecture  Dual x86 Decode ICACHE  Advanced Branch Predictor Bobcat L2  Full OOO instruction execution Low Fetch  Full OOO load/store engine Power  High Performance Floating Point Core  AMD64 64-bit ISA Decode BU  SSE1,2,3, SSSE3 ISA  Secure Virtualization  32kb L1s, 512kb L2 Low Power Design Integer Address FP  Power Optimized Execution Scheduler Scheduler Scheduler  Micro-architecture that minimizes data movement and unnecessary reads I I Load Store A M  Clock gating, Power gating Pipe Pipe Pipe Pipe Pipe Pipe  System Low Power States Small Core DCACHE  Area efficient balance of high performance and low power 19 | Bobcat | Hot Chips 2010
  • 20. Summary  Estimated 90% of the performance of today’s mainstream notebook CPU in half the area*  Sub-one watt capable  Highly portable across designs and manufacturing technologies 20 | Bobcat | Hot Chips 2010 *Based on internal AMD modeling using benchmark simulations