SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
Microprocessor Evolution:

              4004 to Pentium-4



                          Joel Emer
      Computer Science and Artificial Intelligence Laboratory
            Massachusetts Institute of Technology

               Based on the material prepared by

                   Krste Asanovic and Arvind

November 2, 2005
6.823 L15- 2


         First Microprocessor
                                                                  Emer



         Intel 4004, 1971


                                              • 4-bit
                                                accumulator
                                                architecture
          Image removed due to copyright
                     restrictions.            • 8µm pMOS
                  To view image, visit        • 2,300 transistors
        http://news.com.com/Images+Moores+L
                                              • 3 x 4 mm2
         aw+turns+40/2009-1041_3-5649019-
                        5.html                • 750kHz clock
                                              • 8-16 cycles/inst.




November 2, 2005
6.823 L15- 3
                                                                              Emer


   Microprocessors in the Seventies

             Initial target was embedded control
             •	 First micro, 4-bit 4004 from Intel, designed for a 

                desktop printing calculator


             Constrained by what could fit on single chip
             •	 Single accumulator architectures

             8-bit micros used in hobbyist personal computers
             •	 Micral, Altair, TRS-80, Apple-II

             Little impact on conventional computer market
                until VISICALC spreadsheet for Apple-II (6502,
                1MHz)
             •	 First “killer” business application for personal

                computers



November 2, 2005
6.823 L15- 4
                                                                Emer


   DRAM in the Seventies

              Dramatic progress in MOSFET memory
                technology

              1970, Intel introduces first DRAM (1Kbit
                1103)

              1979, Fujitsu introduces 64Kbit DRAM


              => By mid-Seventies, obvious that PCs
               would soon have > 64KBytes physical
               memory


November 2, 2005
6.823 L15- 5

                                                                                        Emer

    Microprocessor Evolution
     Rapid progress in size and speed through 70s
          – Fueled by advances in MOSFET technology and expanding markets


     Intel i432
          –   Most ambitious seventies’ micro; started in 1975 - released 1981
          –   32-bit capability-based object-oriented architecture
          –   Instructions variable number of bits long
          –   Severe performance, complexity, and usability problems


     Motorola 68000 (1979, 8MHz, 68,000 transistors)
          – Heavily microcoded (and nanocoded)
          – 32-bit general purpose register architecture (24 address pins)
          – 8 address registers, 8 data registers


     Intel 8086 (1978, 8MHz, 29,000 transistors)
          – “Stopgap” 16-bit processor, architected in 10 weeks
          – Extended accumulator architecture, assembly-compatible with 8080
          – 20-bit addressing through segmented addressing scheme
November 2, 2005
6.823 L15- 6

      Intel 8086                                                           Emer




       Class        Register       Purpose
       Data:        AX,BX          “general” purpose
                    CX             string and loop ops only
                    DX             mult/div and I/O only

       Address:     SP             stack pointer
                    BP             base pointer (can also use BX)
                    SI,DI          index registers

       Segment:     CS             code segment
                    SS             stack segment
                    DS             data segment
                    ES             extra segment

       Control:     IP             instruction pointer (low 16 bit of PC)
                    FLAGS          C, Z, N, B, P, V and 3 control bits

       • Typical format R <= R op M[X], many addressing modes
       • Not a GPR organization!

November 2, 2005
IBM PC, 1981
                                                                       6.823 L15- 7
                                                                              Emer


     Hardware
     •	 Team from IBM building PC prototypes in 1979
     •	 Motorola 68000 chosen initially, but 68000 was late
     •	 IBM builds “stopgap” prototypes using 8088 boards from 

        Display Writer word processor

     •	 8088 is 8-bit bus version of 8086 => allows cheaper system
     •	 Estimated sales of 250,000
     •	 100,000,000s sold


     Software
     •	 Microsoft negotiates to provide OS for IBM. Later buys and 

        modifies QDOS from Seattle Computer Products.



     Open System
     •	   Standard processor, Intel 8088
     •    Standard interfaces
     •	   Standard OS, MS-DOS
     •	   IBM permits cloning and third-party software
November 2, 2005
The Eighties:	
                                                                             6.823 L15- 8
                                                                                    Emer


       Personal Computer Revolution

      Personal computer market emerges
           –	 Huge business and consumer market for spreadsheets, word 

              processing and games

           –	 Based on inexpensive 8-bit and 16-bit micros: Zilog Z80, Mostek

              6502, Intel 8088/86, …



      Minicomputers replaced by workstations

           –	 Distributed network computing and high-performance graphics for
              scientific and engineering applications (Sun, Apollo, HP,…)
           –	 Based on powerful 32-bit microprocessors with virtual memory, 

              caches, pipelined execution, hardware floating-point

           –	 Commercial RISC processors developed for workstation market


      Massively Parallel Processors (MPPs) appear
           –	 Use many cheap micros to approach supercomputer performance
              (Sequent, Intel, Parsytec)

November 2, 2005
6.823 L15- 9

      The Nineties                                                          Emer




      Advanced superscalar microprocessors appear
           • first superscalar microprocessor is IBM POWER in 1990

      MPPs have limited success in supercomputing market
           • Highest-end mainframes and vector supercomputers survive
           “killer micro” onslaught

      64-bit addressing becomes essential at high-end
           • In 2004, 4GB DRAM costs <$1,000

      Parallel microprocessor-based SMPs take over low-end server
      and supercomputer market

      Workstation and PC markets merge
           • By late ‘90s (except for Apple PowerPC-based systems) RISC
           vendors have tiny share of desktop market
           • CISC x86 ISA thrives!


November 2, 2005
Intel Pentium 4 (2000)
                                                                             6.823 L15- 10
                                                                                     Emer




                   Image removed due to copyright restrictions.
                             To view image, visit http://www-
                   vlsi.stanford.edu/group/chips_micropro_body.html




     This lecture contains figures and data taken from: “The microarchitecture of the
     Pentium 4 processor”, Intel Technology Journal, Q1, 2001

November 2, 2005
6.823 L15- 11

      Pentium 4 uOPs
                                                                Emer




       •	 During L1 instruction cache refill, translates complex
          x86 instructions into RISC-like micro-operations (uops)

            –	 e.g., “R Å R op Mem” translates into

                   load T, Mem   # Load from Mem into temp reg
                   R Å R op T    # Operate using value in temp

       •	 Execute uops using speculative out-of-order superscalar
          engine with register renaming

       •	 uop translation introduced in Pentium Pro family
          architecture (P6 family) in 1995
            –	 also used on Pentium-II and Pentium-III processors, and new
               Pentium M (Centrino) processors




November 2, 2005
6.823 L15- 12

      Instruction Set Translation:                                              Emer

      Convert a target ISA into a host machine’s ISA



              •	 Pentium Pro (P6 family)
                   –	 translation in hardware after instruction fetch
                   –	 also used in AMD x86 processors


              •	 Pentium-4 family
                   –	 translation in hardware at level 1 instruction
                      cache refill


              • Transmeta Crusoe 

                   –	 translation in software using “Code Morphing”
                      (see lecture 24)




November 2, 2005
Pentium 4 Block Diagram
                                                                                                6.823 L15- 13
                                                                                                        Emer




                           System Bus

                                                                      Level 1 Data Cache
                           Bus Unit

                                                                        Execution Units
                         Level 2 Cache
                                                                      INTEGER AND FP
                    MEMORY SUBSYSTEM                                 EXECUTION UNITS




                                  Trace Cache                     Out-of-Order
                   Fetch/Decode    Microcode                       Execution       Retirement
                                     ROM                             Logic


                                                              Branch History Update
                     BTB/Branch Prediction

                         FRONT END                                OUT-OF-ORDER ENGINE



November 2, 2005                             Figure by MIT OCW.
6.823 L15- 14

     Pentium 4 Front End
                                                           Emer




             L2 Cache

                   x86 instructions,    Inst. Prefetch &         Front End BTB
                   8 Bytes/cycle               TLB                 (4K Entries)

           Fetch Buffer
                      Single x86 instruction/cycle


          x86 Decoder                                Translation from x86
                                                     instructions to internal uops
                      4 uops/cycle
                                                     only happens on trace cache
         Trace Cache Fill Buffer                     miss, one x86 instruction per
                          6 uops/line                cycle.
                                                     Translations are cached in
        Trace Cache (12K uops)                       trace cache.

November 2, 2005
6.823 L15- 15

          Trace Cache
                                                  Emer




          Key Idea: Pack multiple non-contiguous basic
            blocks into one contiguous trace cache line

                    BR                  BR                 BR




                           BR             BR          BR



     •	    Single fetch brings in multiple basic blocks
     •	    Trace cache indexed by start address and next n branch
           predictions




November 2, 2005
6.823 L15- 16

        Pentium 4 Trace Cache
                                       Emer




       •	 Holds decoded uops in predicted program flow order, 6
          uops per line
       Code in memory
            cmp
            br T1	
            ...

                                   Code packed in trace cache
       T1: sub                     (6 uops/line)

            br T2

            ...
                         cmp     br T1     sub
       T2: mov
                         br T2     mov      sub
            sub

            br T3
                      br T3     add      sub
            ...
                         mov     br T4    T4:...
       T3: add

            sub

            mov              Trace cache fetches one 6 uop line
            br T4
                             every 2 CPU clock cycles (runs at 1/2
            ...
       T4:	
                             main CPU rate)
November 2, 2005
6.823 L15- 17


     Trace Cache Advantages                                                           Emer




     •	 Removes x86 decode from branch mispredict penalty
          –	 Parallel x86 decoder took 2.5 cycles in P6, would be 5 cycles in P-4
             design

     •	 Allows higher fetch bandwidth fetch for correctly predicted
        taken branches
          –	 P6 had one cycle bubble for correctly predicted taken branches
          –	 P-4 can fetch a branch and its target in same cycle

     •	 Saves energy
          –	 x86 decoder only powered up on trace cache refill




November 2, 2005
P-4 Trace Cache Fetch

                                                                6.823 L15- 18
                                                                        Emer

 1  TC Next IP
 2     (BTB)
 3                                       Trace Cache
     TC Fetch
 4
 5     Drive                     (12K uops, 2K lines of 6 uops)
 6      Alloc
 7
     Rename
 8                    Trace IP
 9     Queue                                             Microcode
10  Schedule 1                                             ROM
11  Schedule 2          Trace BTB
12  Schedule 3         (512 entries)
13  Dispatch 1
                         16-entry                6 uops every two
14  Dispatch 2
                     subroutine return           CPU cycles
15 Register File 1
                       address stack
16 Register File 2
17    Execute
                                            uop buffer
18     Flags
19 Branch Check                                  3 uops/cycle
20     Drive
November 2, 2005
6.823 L15- 19

    Line Prediction
                                                                             Emer



    (Alpha 21[234]64)
                                             Instr
                                             Cache

                                             Branch
                                            Predictor
                        Line                                   PC
                      Predictor                               Calc
                                             Return
                                             Stack

                                            Indirect
                                             Branch
                                            Predictor

           • Line Predictor predicts line to fetch each cycle
                   – 21464 was to predict 2 lines per cycle
           • Icache fetches block, and predictors predict target
           • PC Calc checks accuracy of line prediction(s)



November 2, 2005
P-III vs. P-4 Renaming                                                             6.823 L15- 20
                                                                                                Emer

 1  TC Next IP
 2     (BTB)
 3                                            ROB
                                                                                            RF     ROB
     TC Fetch                                 Data   Status
 4                                                                                          Data   Status
                                                               Frontend RAT
 5     Drive                                                       EAX
                                                                   EBX
 6      Alloc        RAT
                                                                   ECX
                                                                   EDX
 7                   EAX                                           ESL
     Rename          EBX
                     ECX
                                                                   EDL
                                                                   ESP
 8                   EDX                                           EBP
                     ESL
 9     Queue         EDL
                     ESP
                                                              Retirement RAT
                     EBP                                           EAX
10  Schedule 1                                                     EBX
                                                                   ECX
                                                                   EDX
11  Schedule 2                                       RRF           ESL
                                                                   EDL
12  Schedule 3                                                     ESP
                                                                   EBP

13  Dispatch 1
14  Dispatch 2                Pentium R III                                    NetBurstTM
15 Register File 1
                                                     Figure by MIT OCW.
16 Register File 2
17    Execute        P-4 physical register file separated from ROB status.
                     ROB entries allocated sequentially as in P6 family.
18     Flags
                     One of 128 physical registers allocated from free list.
19 Branch Check
                     No data movement on retire, only Retirement RAT
20     Drive            updated.
November 2, 2005
6.823 L15- 21

    P-4 mOp Queues and Schedulers                                         Emer

 1  TC Next IP
                             Allocated/Renamed uops
 2     (BTB)
 3                                        3 uops/cycle
     TC Fetch
 4
 5     Drive         Memory uop                     Arithmetic
 6      Alloc          Queue                        uop Queue
 7
     Rename
 8
 9     Queue                         Fast
10  Schedule 1        Memory        Fast
                                   Scheduler     General     Simple FP
11  Schedule 2       Scheduler    Scheduler
                                     (x2)       Scheduler    Scheduler
12  Schedule 3                      (x2)
13  Dispatch 1
14  Dispatch 2
15 Register File 1
16 Register File 2        Ready uops compete for dispatch ports
17    Execute
18     Flags             (Fast schedulers can each dispatch 2 ALU
                                   operations per cycle)
19 Branch Check
20     Drive
November 2, 2005
6.823 L15- 22

        P-4 Execution Ports                                                               Emer




       Exec Port 0                     Exec Port 1                  Load Port     Store Port




   ALU                         ALU       Integer                     Memory        Memory
 (double       FP Move       (double                   FP Execute
                                        Operation                     Load          Store
  speed)                      speed)

Add/Sub       FP/SSE Move    Add/Sub    Shift/Rotate   FP/SSE-Add   All Loads     Store Address
Logic         FP/SSE Store                             FP/SSE-Mul   LEA
Store Data    FXCH                                     FP/SSE-Div   SW Prefetch
Branches                                               MMX

                                       Figure by MIT OCW.


        •    Schedulers compete for access to execution ports
        •    Loads and stores have dedicated ports
        •    ALUs can execute two operations per cycle
        •    Peak bandwidth of 6 uops per cycle
             – load, store, plus four double-pumped ALU operations

November 2, 2005
6.823 L15- 23

     P-4 Fast ALUs and Bypass Path
                                           Emer




            Register
            File and
            Bypass
            Network




                                           L1 Data
                                            Cache
 •	 Fast ALUs and bypass network runs at twice global clock speed
 •	 All “non-essential” circuit paths handled out of loop to reduce circuit
    loading (shifts, mults/divs, branches, flag/ops)
 •	 Other bypassing takes multiple clock cycles
November 2, 2005
6.823 L15- 24

     P-4 Staggered ALU Design

                                                           Emer




                    •	 Staggers 32-bit add and flag
                       compare into three 1/2 cycle
                       phases
                       – low 16 bits
                       – high 16 bits
                       –	 flag checks

                    •	 Bypass 16 bits around every ½
                       cycle
                       –	 back-to-back dependent 32-bit
                          adds at 3GHz in 0.18mm
                           (7.2GHz in 90nm)

                    •	 L1 Data Cache access starts
                       with bottom 16 bits as index,
                       top 16 bits used as tag check
                       later



November 2, 2005
6.823 L15- 25

     P-4 Load Schedule Speculation                                   Emer


 1
        TC Next IP
 2
 3
         TC Fetch
 4
 5         Drive
 6         Alloc
 7
         Rename
 8
 9       Queue
10     Schedule 1       Long delay from
11     Schedule 2      schedulers to load
12     Schedule 3          hit/miss
13     Dispatch 1
14     Dispatch 2      • P-4 guesses that load will hit in L1 and
15   Register File 1   schedules dependent operations to use value
16   Register File 2
                       • If load misses, only dependent operations are
17   Load Execute 1
                       replayed
18   Load Execute 2
19    Branch Check
20         Drive
November 2, 2005
6.823 L15- 26

      P-4 Branch Penalty                                          Emer

 1
     TC Next IP
 2
 3
      TC Fetch
 4
 5     Drive
 6      Alloc         20 cycle branch
 7                   mispredict penalty
      Rename
 8
 9     Queue
10  Schedule 1
11  Schedule 2
12  Schedule 3
13  Dispatch 1
14  Dispatch 2
15 Register File 1   • P-4 uses new “trade secret” branch 

16 Register File 2   prediction algorithm

17    Execute        • Intel claims 1/3 fewer mispredicts than P6 

18     Flags         algorithm

19 Branch Check
20     Drive
November 2, 2005
6.823 L15- 27

     Tournament Branch Predictor                                            Emer


     (Alpha 21264)

            Local                    Global Prediction
           history       Local          (4,096x2b)
            table      prediction
         (1,024x10b   (1,024x3b)
              )

                                                Choice Prediction
               PC                                 (4,096x2b)


                      Prediction                    Global History (12b)



    • Choice predictor learns whether best to use local or global
      branch history in predicting next branch
    • Global history is speculatively updated but restored on
      mispredict
    • Claim 90-100% success on range of applications
November 2, 2005
6.823 L15- 28

     P-III vs. P-4 Pipelines                                                                                   Emer




                                Basic Pentium R III Processor Misprediction Pipeline

        1           2         3           4         5          6        7           8       9            10
      Fetch       Fetch     Decode      Decode    Decode    Rename    ROB Rd     Rdy/Sch Dispatch       Exec




                                Basic Pentium R 4 Processor Misprediction Pipeline

      1    2      3    4    5      6    7    8    9   10    11   12   13   14    15   16   17   18    19    20
     TC Nxt IP   TC Fetch Drive Alloc   Rename   Que Sch   Sch   Sch Disp Disp   RF   RF   Ex   Flgs Br Ck Drive



                                                 Figure by MIT OCW.


     • In same process technology, ~1.5x clock frequency

     • Performance Equation:
                     Time           =    Instructions * Cycles     * Time
                   Program               Program       Instruction Cycle
November 2, 2005
6.823 L15- 29

     Deep Pipeline Design

                                                                                        Emer




     Greater potential throughput but:

     •	 Clock uncertainty and latch delays eat into cycle time
        budget
          – doubling pipeline depth gives less than twice frequency improvement

     •	 Clock load and power increases
          –	 more latches running at higher frequencies

     •	 More complicated microarchitecture needed to cover long
        branch mispredict penalties and cache miss penalties
          –	 from Little’s Law, need more instructions in flight to cover longer 

             latencies Î larger reorder buffers


     •	 P-4 has three major clock domains
          –	 Double pumped ALU (3 GHz), small critical area at highest speed
          –	 Main CPU pipeline (1.5 GHz in 0.18µm)
          –	 Trace cache (0.75 GHz), save power




November 2, 2005
6.823 L15- 30

      Scaling of Wire Delay
                                                        Emer




      •	 Over time, transistors are getting relatively faster than
         long wires
           –	 wire resistance growing dramatically with shrinking width and height
           –	 capacitance roughly fixed for constant length wire
           –	 RC delays of fixed length wire rising

      •	 Chips are getting bigger
           –	 P-4 >2x size of P-III

      •	 Clock frequency rising faster than transistor speed
           –	 deeper pipelines, fewer logic gates per cycle
           –	 more advanced circuit designs (each gate goes faster)

      ⇒ Takes multiple cycles for signal to cross chip




November 2, 2005
Visible Wire Delay in P-4 Design
                                                  6.823 L15- 31
                                                          Emer

 1
     TC Next IP
 2
 3
      TC Fetch
 4
 5     Drive
 6      Alloc
 7
      Rename
 8
 9     Queue
10  Schedule 1
11  Schedule 2
12  Schedule 3
                     Pipeline stages dedicated to just
13  Dispatch 1         driving signals across chip!
14  Dispatch 2
15 Register File 1
16 Register File 2
17    Execute
18     Flags
19 Branch Check
20     Drive
November 2, 2005
P-4 Microarchitecture
                                                                            6.823 L15- 32
                                                                                                                Emer



                                             Instruction TLB/                  64 bits wide
           Front-End BTB
             (4K Entries)                       Prefetcher
                                                                                                       System Bus
                                           Instruction Decoder               Microcode
                                                                               ROM
          Trace Cache BTB
            (512 Entries)                Trace Cache (12K µops)              µop Queue
                                                                                                   Quad
                                                                                                  Pumped
                                Allocator/Register Renamer                                        3.2 GB/s


      Memory µop Queue                      Integer/Floating Point µop Queue
                                                                                                     Bus
                                                                                                  Interface
      Memory Scheduler        Fast          Slow/General FP Scheduler           Simple FP
                                                                                                     Unit


                 Integer Register File/Bypass Network             FP Register/Bypass



       AGU         AGU      2x ALU     2x ALU      Slow ALU            FP
                                                                      MMX          FP            L2 Cache
       Load      Store      Simple     Simple      Complex            SSE         Move          (256K byte
      Address   Address      Instr.     Instr.       Instr.           SSE2                        8-way)

                                                                                                 48 GB/s

                             L1 Data Cache (8Kbyte 4-way)
                                                                                     256 bits


                                                 Figure by MIT OCW.
November 2, 2005
6.823 L15- 33

     Microarchitecture Comparison
                                                           Emer



                     In-Order                              Out-of-Order 

                     Execution                              Execution


                   Fetch    Br. Pred.                      Fetch    Br. Pred.
                                             In-Order
                  Decode          Resolve                 Decode          Resolve
  In-Order
                                                                                  Out-of-Order
                  Execute                                   ROB      Execute
                                             In-Order
                  Commit                                  Commit
             •	    Speculative fetch but not       •    Speculative execution, with

                   speculative execution -              branches resolved after later 

                   branch resolves before               instructions complete

                   later instructions complete

                                                   •	   Completed values held in rename
             •	    Completed values held in             registers in ROB or unified physical
                   bypass network until                 register file until commit
                   commit

       • Both styles of machine can use same branch predictors in front-end fetch
       pipeline, and both can execute multiple instructions per cycle
       • Common to have 10-30 pipeline stages in either style of design

November 2, 2005
6.823 L15- 34

     MIPS R10000 (1995)	                                                Emer



                                         •	  0.35µm CMOS, 4 metal layers
                                         •	  Four instructions per cycle
                                         •	  Out-of-order execution
                                         •	  Register renaming
                                         •	  Speculative execution past 4
                                             branches
                                          •	 On-chip 32KB/32KB split I/D
    Image removed due to copyright           cache, 2-way set-associative
                 restrictions.	
                                          • Off-chip L2 cache
    To view the image, visit http://www-
  vlsi.stanford.edu/group/chips_micropro_ • Non-blocking caches
                   body.html
                                          Compare with simple 5-stage
                                             pipeline (R5K series)
                                          •	 ~1.6x performance SPECint95
                                          •	 ~5x CPU logic area
                                          •	 ~10x design effort



November 2, 2005

Más contenido relacionado

La actualidad más candente

Intel Microprocessors- a Top down Approach
Intel Microprocessors- a Top down ApproachIntel Microprocessors- a Top down Approach
Intel Microprocessors- a Top down ApproachEditor IJCATR
 
4th generation intel core processoer
4th generation intel core processoer4th generation intel core processoer
4th generation intel core processoerAanamika Nath
 
6th gen processor
6th gen processor6th gen processor
6th gen processorAmit Sinha
 
Generation of computer processors
Generation of computer processorsGeneration of computer processors
Generation of computer processorsArshad Qureshi
 
intel Presentation
intel Presentationintel Presentation
intel Presentationfinance6
 
Intel Processors
Intel ProcessorsIntel Processors
Intel Processorshome
 
AMD Mainstream Desktop Platform Media Presentations
AMD Mainstream Desktop Platform Media PresentationsAMD Mainstream Desktop Platform Media Presentations
AMD Mainstream Desktop Platform Media PresentationsBite Communications
 
Assmemble langauge for slideshare.net
Assmemble langauge for slideshare.netAssmemble langauge for slideshare.net
Assmemble langauge for slideshare.netilias ahmed
 
Mother board & Processor
Mother board & ProcessorMother board & Processor
Mother board & ProcessorPraveen Vs
 
Intel core i3, i5, i7 , core2 duo and atom processors
Intel core i3, i5, i7 , core2 duo and atom processorsIntel core i3, i5, i7 , core2 duo and atom processors
Intel core i3, i5, i7 , core2 duo and atom processorsFadyMorris
 
6th gen intel processor
6th gen intel processor 6th gen intel processor
6th gen intel processor Amit Sinha
 
Intel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product BriefIntel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product BriefOscar del Rio
 
Presentation on - Processors
Presentation on - Processors Presentation on - Processors
Presentation on - Processors The Avi Sharma
 
AMD Processor
AMD ProcessorAMD Processor
AMD ProcessorAli Fahad
 

La actualidad más candente (20)

Intel Microprocessors- a Top down Approach
Intel Microprocessors- a Top down ApproachIntel Microprocessors- a Top down Approach
Intel Microprocessors- a Top down Approach
 
Intel microprocessors
Intel microprocessorsIntel microprocessors
Intel microprocessors
 
Intel
IntelIntel
Intel
 
4th generation intel core processoer
4th generation intel core processoer4th generation intel core processoer
4th generation intel core processoer
 
6th gen processor
6th gen processor6th gen processor
6th gen processor
 
Generation of computer processors
Generation of computer processorsGeneration of computer processors
Generation of computer processors
 
intel Presentation
intel Presentationintel Presentation
intel Presentation
 
Intel Processors
Intel ProcessorsIntel Processors
Intel Processors
 
Intel core i7 processor
Intel core i7 processorIntel core i7 processor
Intel core i7 processor
 
AMD Mainstream Desktop Platform Media Presentations
AMD Mainstream Desktop Platform Media PresentationsAMD Mainstream Desktop Platform Media Presentations
AMD Mainstream Desktop Platform Media Presentations
 
Assmemble langauge for slideshare.net
Assmemble langauge for slideshare.netAssmemble langauge for slideshare.net
Assmemble langauge for slideshare.net
 
Intel i7
Intel i7Intel i7
Intel i7
 
Mother board & Processor
Mother board & ProcessorMother board & Processor
Mother board & Processor
 
Intel Core i7
Intel Core i7Intel Core i7
Intel Core i7
 
Intel core i3, i5, i7 , core2 duo and atom processors
Intel core i3, i5, i7 , core2 duo and atom processorsIntel core i3, i5, i7 , core2 duo and atom processors
Intel core i3, i5, i7 , core2 duo and atom processors
 
6th gen intel processor
6th gen intel processor 6th gen intel processor
6th gen intel processor
 
intel core i7
intel core i7 intel core i7
intel core i7
 
Intel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product BriefIntel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product Brief
 
Presentation on - Processors
Presentation on - Processors Presentation on - Processors
Presentation on - Processors
 
AMD Processor
AMD ProcessorAMD Processor
AMD Processor
 

Similar a L15 micro evlutn

Sistem mikroprosessor
Sistem mikroprosessorSistem mikroprosessor
Sistem mikroprosessorfahmihafid
 
Journey of Microprocessors By Basit Ali
Journey of Microprocessors By Basit AliJourney of Microprocessors By Basit Ali
Journey of Microprocessors By Basit AliBasit Ali
 
Ba401 Intel Corporation
Ba401 Intel CorporationBa401 Intel Corporation
Ba401 Intel CorporationBA401NU
 
Introduction to microprocessor
Introduction to microprocessorIntroduction to microprocessor
Introduction to microprocessorKashyap Shah
 
Evolution of Computing Microprocessors and SoCs
Evolution of Computing Microprocessors and SoCsEvolution of Computing Microprocessors and SoCs
Evolution of Computing Microprocessors and SoCsazmathmoosa
 
Microprocessor
MicroprocessorMicroprocessor
MicroprocessorAshok Raj
 
Evolution of microprocessors
Evolution of microprocessorsEvolution of microprocessors
Evolution of microprocessorsAnas Abrar
 
Branrel Santos Powerpoint Presentation
Branrel Santos  Powerpoint PresentationBranrel Santos  Powerpoint Presentation
Branrel Santos Powerpoint PresentationJesus Obenita Jr.
 
01 - Introdction to Microprocessor .pptx
01 - Introdction to Microprocessor .pptx01 - Introdction to Microprocessor .pptx
01 - Introdction to Microprocessor .pptxBPVijay
 
arquitectura_de_las_pc.pdf
arquitectura_de_las_pc.pdfarquitectura_de_las_pc.pdf
arquitectura_de_las_pc.pdfbrydyl
 
A New Golden Age for Computer Architecture
A New Golden Age for Computer ArchitectureA New Golden Age for Computer Architecture
A New Golden Age for Computer ArchitectureYanbin Kong
 

Similar a L15 micro evlutn (20)

Microprocessor
MicroprocessorMicroprocessor
Microprocessor
 
Sistem mikroprosessor
Sistem mikroprosessorSistem mikroprosessor
Sistem mikroprosessor
 
Microprocessors
MicroprocessorsMicroprocessors
Microprocessors
 
Journey of Microprocessors By Basit Ali
Journey of Microprocessors By Basit AliJourney of Microprocessors By Basit Ali
Journey of Microprocessors By Basit Ali
 
processor.pptx
processor.pptxprocessor.pptx
processor.pptx
 
Intel i3 processor
 Intel i3 processor Intel i3 processor
Intel i3 processor
 
Ba401 Intel Corporation
Ba401 Intel CorporationBa401 Intel Corporation
Ba401 Intel Corporation
 
Introduction to microprocessor
Introduction to microprocessorIntroduction to microprocessor
Introduction to microprocessor
 
Evolution of Computing Microprocessors and SoCs
Evolution of Computing Microprocessors and SoCsEvolution of Computing Microprocessors and SoCs
Evolution of Computing Microprocessors and SoCs
 
Mp evolution
Mp evolutionMp evolution
Mp evolution
 
Microprocessor
MicroprocessorMicroprocessor
Microprocessor
 
AMD K6
AMD K6AMD K6
AMD K6
 
VLSI Design
VLSI DesignVLSI Design
VLSI Design
 
Evolution of microprocessors
Evolution of microprocessorsEvolution of microprocessors
Evolution of microprocessors
 
Branrel Santos Powerpoint Presentation
Branrel Santos  Powerpoint PresentationBranrel Santos  Powerpoint Presentation
Branrel Santos Powerpoint Presentation
 
introduction to MP.ppt
introduction to MP.pptintroduction to MP.ppt
introduction to MP.ppt
 
01 - Introdction to Microprocessor .pptx
01 - Introdction to Microprocessor .pptx01 - Introdction to Microprocessor .pptx
01 - Introdction to Microprocessor .pptx
 
arquitectura_de_las_pc.pdf
arquitectura_de_las_pc.pdfarquitectura_de_las_pc.pdf
arquitectura_de_las_pc.pdf
 
Evolution of processors
Evolution of processorsEvolution of processors
Evolution of processors
 
A New Golden Age for Computer Architecture
A New Golden Age for Computer ArchitectureA New Golden Age for Computer Architecture
A New Golden Age for Computer Architecture
 

Último

Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxMYDA ANGELICA SUAN
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxEduSkills OECD
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsEugene Lysak
 
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxPractical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxKatherine Villaluna
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17Celine George
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and stepobaje godwin sunday
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxheathfieldcps1
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...raviapr7
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxraviapr7
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17Celine George
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapitolTechU
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17Celine George
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?TechSoup
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxSaurabhParmar42
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRATanmoy Mishra
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptxraviapr7
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxDr. Santhosh Kumar. N
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxiammrhaywood
 
How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17Celine George
 

Último (20)

Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptx
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George Wells
 
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxPractical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17
 
Prelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quizPrelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quiz
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and step
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptx
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptx
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptx
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptx
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptx
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
 
How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17
 

L15 micro evlutn

  • 1. Microprocessor Evolution: 4004 to Pentium-4 Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind November 2, 2005
  • 2. 6.823 L15- 2 First Microprocessor Emer Intel 4004, 1971 • 4-bit accumulator architecture Image removed due to copyright restrictions. • 8µm pMOS To view image, visit • 2,300 transistors http://news.com.com/Images+Moores+L • 3 x 4 mm2 aw+turns+40/2009-1041_3-5649019- 5.html • 750kHz clock • 8-16 cycles/inst. November 2, 2005
  • 3. 6.823 L15- 3 Emer Microprocessors in the Seventies Initial target was embedded control • First micro, 4-bit 4004 from Intel, designed for a desktop printing calculator Constrained by what could fit on single chip • Single accumulator architectures 8-bit micros used in hobbyist personal computers • Micral, Altair, TRS-80, Apple-II Little impact on conventional computer market until VISICALC spreadsheet for Apple-II (6502, 1MHz) • First “killer” business application for personal computers November 2, 2005
  • 4. 6.823 L15- 4 Emer DRAM in the Seventies Dramatic progress in MOSFET memory technology 1970, Intel introduces first DRAM (1Kbit 1103) 1979, Fujitsu introduces 64Kbit DRAM => By mid-Seventies, obvious that PCs would soon have > 64KBytes physical memory November 2, 2005
  • 5. 6.823 L15- 5 Emer Microprocessor Evolution Rapid progress in size and speed through 70s – Fueled by advances in MOSFET technology and expanding markets Intel i432 – Most ambitious seventies’ micro; started in 1975 - released 1981 – 32-bit capability-based object-oriented architecture – Instructions variable number of bits long – Severe performance, complexity, and usability problems Motorola 68000 (1979, 8MHz, 68,000 transistors) – Heavily microcoded (and nanocoded) – 32-bit general purpose register architecture (24 address pins) – 8 address registers, 8 data registers Intel 8086 (1978, 8MHz, 29,000 transistors) – “Stopgap” 16-bit processor, architected in 10 weeks – Extended accumulator architecture, assembly-compatible with 8080 – 20-bit addressing through segmented addressing scheme November 2, 2005
  • 6. 6.823 L15- 6 Intel 8086 Emer Class Register Purpose Data: AX,BX “general” purpose CX string and loop ops only DX mult/div and I/O only Address: SP stack pointer BP base pointer (can also use BX) SI,DI index registers Segment: CS code segment SS stack segment DS data segment ES extra segment Control: IP instruction pointer (low 16 bit of PC) FLAGS C, Z, N, B, P, V and 3 control bits • Typical format R <= R op M[X], many addressing modes • Not a GPR organization! November 2, 2005
  • 7. IBM PC, 1981 6.823 L15- 7 Emer Hardware • Team from IBM building PC prototypes in 1979 • Motorola 68000 chosen initially, but 68000 was late • IBM builds “stopgap” prototypes using 8088 boards from Display Writer word processor • 8088 is 8-bit bus version of 8086 => allows cheaper system • Estimated sales of 250,000 • 100,000,000s sold Software • Microsoft negotiates to provide OS for IBM. Later buys and modifies QDOS from Seattle Computer Products. Open System • Standard processor, Intel 8088 • Standard interfaces • Standard OS, MS-DOS • IBM permits cloning and third-party software November 2, 2005
  • 8. The Eighties: 6.823 L15- 8 Emer Personal Computer Revolution Personal computer market emerges – Huge business and consumer market for spreadsheets, word processing and games – Based on inexpensive 8-bit and 16-bit micros: Zilog Z80, Mostek 6502, Intel 8088/86, … Minicomputers replaced by workstations – Distributed network computing and high-performance graphics for scientific and engineering applications (Sun, Apollo, HP,…) – Based on powerful 32-bit microprocessors with virtual memory, caches, pipelined execution, hardware floating-point – Commercial RISC processors developed for workstation market Massively Parallel Processors (MPPs) appear – Use many cheap micros to approach supercomputer performance (Sequent, Intel, Parsytec) November 2, 2005
  • 9. 6.823 L15- 9 The Nineties Emer Advanced superscalar microprocessors appear • first superscalar microprocessor is IBM POWER in 1990 MPPs have limited success in supercomputing market • Highest-end mainframes and vector supercomputers survive “killer micro” onslaught 64-bit addressing becomes essential at high-end • In 2004, 4GB DRAM costs <$1,000 Parallel microprocessor-based SMPs take over low-end server and supercomputer market Workstation and PC markets merge • By late ‘90s (except for Apple PowerPC-based systems) RISC vendors have tiny share of desktop market • CISC x86 ISA thrives! November 2, 2005
  • 10. Intel Pentium 4 (2000) 6.823 L15- 10 Emer Image removed due to copyright restrictions. To view image, visit http://www- vlsi.stanford.edu/group/chips_micropro_body.html This lecture contains figures and data taken from: “The microarchitecture of the Pentium 4 processor”, Intel Technology Journal, Q1, 2001 November 2, 2005
  • 11. 6.823 L15- 11 Pentium 4 uOPs Emer • During L1 instruction cache refill, translates complex x86 instructions into RISC-like micro-operations (uops) – e.g., “R Å R op Mem” translates into load T, Mem # Load from Mem into temp reg R Å R op T # Operate using value in temp • Execute uops using speculative out-of-order superscalar engine with register renaming • uop translation introduced in Pentium Pro family architecture (P6 family) in 1995 – also used on Pentium-II and Pentium-III processors, and new Pentium M (Centrino) processors November 2, 2005
  • 12. 6.823 L15- 12 Instruction Set Translation: Emer Convert a target ISA into a host machine’s ISA • Pentium Pro (P6 family) – translation in hardware after instruction fetch – also used in AMD x86 processors • Pentium-4 family – translation in hardware at level 1 instruction cache refill • Transmeta Crusoe – translation in software using “Code Morphing” (see lecture 24) November 2, 2005
  • 13. Pentium 4 Block Diagram 6.823 L15- 13 Emer System Bus Level 1 Data Cache Bus Unit Execution Units Level 2 Cache INTEGER AND FP MEMORY SUBSYSTEM EXECUTION UNITS Trace Cache Out-of-Order Fetch/Decode Microcode Execution Retirement ROM Logic Branch History Update BTB/Branch Prediction FRONT END OUT-OF-ORDER ENGINE November 2, 2005 Figure by MIT OCW.
  • 14. 6.823 L15- 14 Pentium 4 Front End Emer L2 Cache x86 instructions, Inst. Prefetch & Front End BTB 8 Bytes/cycle TLB (4K Entries) Fetch Buffer Single x86 instruction/cycle x86 Decoder Translation from x86 instructions to internal uops 4 uops/cycle only happens on trace cache Trace Cache Fill Buffer miss, one x86 instruction per 6 uops/line cycle. Translations are cached in Trace Cache (12K uops) trace cache. November 2, 2005
  • 15. 6.823 L15- 15 Trace Cache Emer Key Idea: Pack multiple non-contiguous basic blocks into one contiguous trace cache line BR BR BR BR BR BR • Single fetch brings in multiple basic blocks • Trace cache indexed by start address and next n branch predictions November 2, 2005
  • 16. 6.823 L15- 16 Pentium 4 Trace Cache Emer • Holds decoded uops in predicted program flow order, 6 uops per line Code in memory cmp br T1 ... Code packed in trace cache T1: sub (6 uops/line) br T2 ... cmp br T1 sub T2: mov br T2 mov sub sub br T3 br T3 add sub ... mov br T4 T4:... T3: add sub mov Trace cache fetches one 6 uop line br T4 every 2 CPU clock cycles (runs at 1/2 ... T4: main CPU rate) November 2, 2005
  • 17. 6.823 L15- 17 Trace Cache Advantages Emer • Removes x86 decode from branch mispredict penalty – Parallel x86 decoder took 2.5 cycles in P6, would be 5 cycles in P-4 design • Allows higher fetch bandwidth fetch for correctly predicted taken branches – P6 had one cycle bubble for correctly predicted taken branches – P-4 can fetch a branch and its target in same cycle • Saves energy – x86 decoder only powered up on trace cache refill November 2, 2005
  • 18. P-4 Trace Cache Fetch 6.823 L15- 18 Emer 1 TC Next IP 2 (BTB) 3 Trace Cache TC Fetch 4 5 Drive (12K uops, 2K lines of 6 uops) 6 Alloc 7 Rename 8 Trace IP 9 Queue Microcode 10 Schedule 1 ROM 11 Schedule 2 Trace BTB 12 Schedule 3 (512 entries) 13 Dispatch 1 16-entry 6 uops every two 14 Dispatch 2 subroutine return CPU cycles 15 Register File 1 address stack 16 Register File 2 17 Execute uop buffer 18 Flags 19 Branch Check 3 uops/cycle 20 Drive November 2, 2005
  • 19. 6.823 L15- 19 Line Prediction Emer (Alpha 21[234]64) Instr Cache Branch Predictor Line PC Predictor Calc Return Stack Indirect Branch Predictor • Line Predictor predicts line to fetch each cycle – 21464 was to predict 2 lines per cycle • Icache fetches block, and predictors predict target • PC Calc checks accuracy of line prediction(s) November 2, 2005
  • 20. P-III vs. P-4 Renaming 6.823 L15- 20 Emer 1 TC Next IP 2 (BTB) 3 ROB RF ROB TC Fetch Data Status 4 Data Status Frontend RAT 5 Drive EAX EBX 6 Alloc RAT ECX EDX 7 EAX ESL Rename EBX ECX EDL ESP 8 EDX EBP ESL 9 Queue EDL ESP Retirement RAT EBP EAX 10 Schedule 1 EBX ECX EDX 11 Schedule 2 RRF ESL EDL 12 Schedule 3 ESP EBP 13 Dispatch 1 14 Dispatch 2 Pentium R III NetBurstTM 15 Register File 1 Figure by MIT OCW. 16 Register File 2 17 Execute P-4 physical register file separated from ROB status. ROB entries allocated sequentially as in P6 family. 18 Flags One of 128 physical registers allocated from free list. 19 Branch Check No data movement on retire, only Retirement RAT 20 Drive updated. November 2, 2005
  • 21. 6.823 L15- 21 P-4 mOp Queues and Schedulers Emer 1 TC Next IP Allocated/Renamed uops 2 (BTB) 3 3 uops/cycle TC Fetch 4 5 Drive Memory uop Arithmetic 6 Alloc Queue uop Queue 7 Rename 8 9 Queue Fast 10 Schedule 1 Memory Fast Scheduler General Simple FP 11 Schedule 2 Scheduler Scheduler (x2) Scheduler Scheduler 12 Schedule 3 (x2) 13 Dispatch 1 14 Dispatch 2 15 Register File 1 16 Register File 2 Ready uops compete for dispatch ports 17 Execute 18 Flags (Fast schedulers can each dispatch 2 ALU operations per cycle) 19 Branch Check 20 Drive November 2, 2005
  • 22. 6.823 L15- 22 P-4 Execution Ports Emer Exec Port 0 Exec Port 1 Load Port Store Port ALU ALU Integer Memory Memory (double FP Move (double FP Execute Operation Load Store speed) speed) Add/Sub FP/SSE Move Add/Sub Shift/Rotate FP/SSE-Add All Loads Store Address Logic FP/SSE Store FP/SSE-Mul LEA Store Data FXCH FP/SSE-Div SW Prefetch Branches MMX Figure by MIT OCW. • Schedulers compete for access to execution ports • Loads and stores have dedicated ports • ALUs can execute two operations per cycle • Peak bandwidth of 6 uops per cycle – load, store, plus four double-pumped ALU operations November 2, 2005
  • 23. 6.823 L15- 23 P-4 Fast ALUs and Bypass Path Emer Register File and Bypass Network L1 Data Cache • Fast ALUs and bypass network runs at twice global clock speed • All “non-essential” circuit paths handled out of loop to reduce circuit loading (shifts, mults/divs, branches, flag/ops) • Other bypassing takes multiple clock cycles November 2, 2005
  • 24. 6.823 L15- 24 P-4 Staggered ALU Design Emer • Staggers 32-bit add and flag compare into three 1/2 cycle phases – low 16 bits – high 16 bits – flag checks • Bypass 16 bits around every ½ cycle – back-to-back dependent 32-bit adds at 3GHz in 0.18mm (7.2GHz in 90nm) • L1 Data Cache access starts with bottom 16 bits as index, top 16 bits used as tag check later November 2, 2005
  • 25. 6.823 L15- 25 P-4 Load Schedule Speculation Emer 1 TC Next IP 2 3 TC Fetch 4 5 Drive 6 Alloc 7 Rename 8 9 Queue 10 Schedule 1 Long delay from 11 Schedule 2 schedulers to load 12 Schedule 3 hit/miss 13 Dispatch 1 14 Dispatch 2 • P-4 guesses that load will hit in L1 and 15 Register File 1 schedules dependent operations to use value 16 Register File 2 • If load misses, only dependent operations are 17 Load Execute 1 replayed 18 Load Execute 2 19 Branch Check 20 Drive November 2, 2005
  • 26. 6.823 L15- 26 P-4 Branch Penalty Emer 1 TC Next IP 2 3 TC Fetch 4 5 Drive 6 Alloc 20 cycle branch 7 mispredict penalty Rename 8 9 Queue 10 Schedule 1 11 Schedule 2 12 Schedule 3 13 Dispatch 1 14 Dispatch 2 15 Register File 1 • P-4 uses new “trade secret” branch 16 Register File 2 prediction algorithm 17 Execute • Intel claims 1/3 fewer mispredicts than P6 18 Flags algorithm 19 Branch Check 20 Drive November 2, 2005
  • 27. 6.823 L15- 27 Tournament Branch Predictor Emer (Alpha 21264) Local Global Prediction history Local (4,096x2b) table prediction (1,024x10b (1,024x3b) ) Choice Prediction PC (4,096x2b) Prediction Global History (12b) • Choice predictor learns whether best to use local or global branch history in predicting next branch • Global history is speculatively updated but restored on mispredict • Claim 90-100% success on range of applications November 2, 2005
  • 28. 6.823 L15- 28 P-III vs. P-4 Pipelines Emer Basic Pentium R III Processor Misprediction Pipeline 1 2 3 4 5 6 7 8 9 10 Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec Basic Pentium R 4 Processor Misprediction Pipeline 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex Flgs Br Ck Drive Figure by MIT OCW. • In same process technology, ~1.5x clock frequency • Performance Equation: Time = Instructions * Cycles * Time Program Program Instruction Cycle November 2, 2005
  • 29. 6.823 L15- 29 Deep Pipeline Design Emer Greater potential throughput but: • Clock uncertainty and latch delays eat into cycle time budget – doubling pipeline depth gives less than twice frequency improvement • Clock load and power increases – more latches running at higher frequencies • More complicated microarchitecture needed to cover long branch mispredict penalties and cache miss penalties – from Little’s Law, need more instructions in flight to cover longer latencies Î larger reorder buffers • P-4 has three major clock domains – Double pumped ALU (3 GHz), small critical area at highest speed – Main CPU pipeline (1.5 GHz in 0.18µm) – Trace cache (0.75 GHz), save power November 2, 2005
  • 30. 6.823 L15- 30 Scaling of Wire Delay Emer • Over time, transistors are getting relatively faster than long wires – wire resistance growing dramatically with shrinking width and height – capacitance roughly fixed for constant length wire – RC delays of fixed length wire rising • Chips are getting bigger – P-4 >2x size of P-III • Clock frequency rising faster than transistor speed – deeper pipelines, fewer logic gates per cycle – more advanced circuit designs (each gate goes faster) ⇒ Takes multiple cycles for signal to cross chip November 2, 2005
  • 31. Visible Wire Delay in P-4 Design 6.823 L15- 31 Emer 1 TC Next IP 2 3 TC Fetch 4 5 Drive 6 Alloc 7 Rename 8 9 Queue 10 Schedule 1 11 Schedule 2 12 Schedule 3 Pipeline stages dedicated to just 13 Dispatch 1 driving signals across chip! 14 Dispatch 2 15 Register File 1 16 Register File 2 17 Execute 18 Flags 19 Branch Check 20 Drive November 2, 2005
  • 32. P-4 Microarchitecture 6.823 L15- 32 Emer Instruction TLB/ 64 bits wide Front-End BTB (4K Entries) Prefetcher System Bus Instruction Decoder Microcode ROM Trace Cache BTB (512 Entries) Trace Cache (12K µops) µop Queue Quad Pumped Allocator/Register Renamer 3.2 GB/s Memory µop Queue Integer/Floating Point µop Queue Bus Interface Memory Scheduler Fast Slow/General FP Scheduler Simple FP Unit Integer Register File/Bypass Network FP Register/Bypass AGU AGU 2x ALU 2x ALU Slow ALU FP MMX FP L2 Cache Load Store Simple Simple Complex SSE Move (256K byte Address Address Instr. Instr. Instr. SSE2 8-way) 48 GB/s L1 Data Cache (8Kbyte 4-way) 256 bits Figure by MIT OCW. November 2, 2005
  • 33. 6.823 L15- 33 Microarchitecture Comparison Emer In-Order Out-of-Order Execution Execution Fetch Br. Pred. Fetch Br. Pred. In-Order Decode Resolve Decode Resolve In-Order Out-of-Order Execute ROB Execute In-Order Commit Commit • Speculative fetch but not • Speculative execution, with speculative execution - branches resolved after later branch resolves before instructions complete later instructions complete • Completed values held in rename • Completed values held in registers in ROB or unified physical bypass network until register file until commit commit • Both styles of machine can use same branch predictors in front-end fetch pipeline, and both can execute multiple instructions per cycle • Common to have 10-30 pipeline stages in either style of design November 2, 2005
  • 34. 6.823 L15- 34 MIPS R10000 (1995) Emer • 0.35µm CMOS, 4 metal layers • Four instructions per cycle • Out-of-order execution • Register renaming • Speculative execution past 4 branches • On-chip 32KB/32KB split I/D Image removed due to copyright cache, 2-way set-associative restrictions. • Off-chip L2 cache To view the image, visit http://www- vlsi.stanford.edu/group/chips_micropro_ • Non-blocking caches body.html Compare with simple 5-stage pipeline (R5K series) • ~1.6x performance SPECint95 • ~5x CPU logic area • ~10x design effort November 2, 2005