SlideShare una empresa de Scribd logo
1 de 82
Descargar para leer sin conexión
CS 211: Computer Architecture


     Instructor: Prof. Bhagi Narahari
       Dept. of Computer Science
              Course URL:
   www.seas.gwu.edu/~narahari/cs211/




                                        CS211 1
How to improve performance?

 • Recall performance is function of
    – CPI: cycles per instruction
    – Clock cycle
    – Instruction count
 • Reducing any of the 3 factors will lead to
   improved performance




                                                CS211 2
How to improve performance?

 • First step is to apply concept of pipelining to
   the instruction execution process
    – Overlap computations
 • What does this do?
    – Decrease clock cycle
    – Decrease effective CPU time compared to original
      clock cycle
 • Appendix A of Textbook
    – Also parts of Chapter 2



                                                         CS211 3
Pipeline Approach to Improve
System Performance
 • Analogous to fluid flow in pipelines and
   assembly line in factories
 • Divide process into “stages” and send tasks
   into a pipeline
    – Overlap computations of different tasks by
      operating on them concurrently in different stages




                                                           CS211 4
Instruction Pipeline

 • Instruction execution process lends itself
   naturally to pipelining
    – overlap the subtasks of instruction fetch, decode
      and execute




                                                          CS211 5
Linear Pipeline                         3

          Processor
Linear pipeline processes a sequence of
subtasks with linear precedence
At a higher level - Sequence of processors

Data flowing in streams from stage S1 to the
final stage Sk

Control of data flow : synchronous or
asynchronous


     S1       S2      • • • •     Sk



                                               CS211 6
Synchronous Pipeline                         4




All transfers simultaneous

One task or operation enters the pipeline per cycle

Processors reservation table : diagonal




                                                      CS211 7
Time Space Utilization of Pipeline

                         Full pipeline after 4 cycles



 S3                 T1         T2

             T1
  S2                T2         T3


       T1    T2     T3         T4
  S1
       1     2      3           4

       Time (in pipeline cycles)

                                                        CS211 8
Asynchronous                                      5


            Pipeline
Transfers performed when individual processors are ready

Handshaking protocol between processors

Mainly used in multiprocessor systems with message-passing




                                                           CS211 9
Pipeline Clock and                           6


             Timing

                      Si                  Si+1


                         τ               τm       d



Clock cycle of the pipeline : τ

Latch delay : d
                             τ = max {τ m } + d


Pipeline frequency : f

                             f=1/τ                    CS211 10
Speedup and                                             7

             Efficiency
k-stage pipeline processes n tasks in k + (n-1) clock
cycles:

k cycles for the first task and n-1 cycles
for the remaining n-1 tasks

Total time to process n tasks

                                        Tk = [ k + (n-1)] τ

For the non-pipelined processor

                                        T1 = n k τ

Speedup factor
                       T1         nkτ                  nk
                Sk =        = [ k + (n-1)] τ   =     k + (n-1)
                       Tk
                                                                 CS211 11
10
             Efficiency and Throughput




 Efficiency of the k-stages pipeline :

                         Sk             n
                  Ek =          =
                         k          k + (n-1)


Pipeline throughput (the number of tasks per unit time) :
note equivalence to IPC


                              n                    nf
                  Hk =                  =
                         [ k + (n-1)] τ         k + (n-1)




                                                            CS211 12
Pipeline Performance: Example
•   Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 ns
    (nanoseconds)
•   latch delay = 10
•   Pipeline cycle time = 90+10 = 100 ns
•   For non-pipelined execution
    – time = 60+50+90+80 = 280 ns
•   Speedup for above case is: 280/100 = 2.8 !!
•   Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns
•   Sequential time = 1000*280ns
•   Throughput= 1000/1003
•   What is the problem here ?
•   How to improve performance ?



                                                                  CS211 13
Non-linear pipelines and pipeline
control algorithms
 • Can have non-linear path in pipeline…
    – How to schedule instructions so they do no conflict
      for resources
 • How does one control the pipeline at the
   microarchitecture level
    – How to build a scheduler in hardware ?
    – How much time does scheduler have to make
      decision ?




                                                            CS211 14
Non-linear Dynamic Pipelines

 •   Multiple processors (k-stages) as linear pipeline
 •   Variable functions of individual processors
 •   Functions may be dynamically assigned
 •   Feedforward and feedback connections




                                                         CS211 15
Reservation Tables

 • Reservation table : displays time-space flow of data
   through the pipeline analogous to opcode of pipeline
    – Not diagonal, as in linear pipelines
 • Multiple reservation tables for different functions
    – Functions may be dynamically assigned


 • Feedforward and feedback connections

 • The number of columns in the reservation table :
   evaluation time of a given function



                                                          CS211 16
14

Reservation Tables (Examples)




                                CS211 17
Latency Analysis

 • Latency : the number of clock cycles
   between two initiations of the pipeline

 • Collision : an attempt by two initiations to use
   the same pipeline stage at the same time

 • Some latencies cause collision, some not




                                                      CS211 18
16
          Collisions (Example)




1    2     3    4      5      6        7      8       9      10

x1        x2          x3     x1       x4     x1x2            x2x3

     x1        x1x2          x2x3            x3x4            x4

          x1          x1x2          x1x2x4          x2x3x4



                    Latency = 2
                                                                    CS211 19
17
                                  Latency Cycle


1    2    3    4    5    6   7    8    9 10 11 12 13 14 15 16 17 18

x1                       x1 x2 x1                      x2 x3 x2                    x3

     x1        x1                 x2         x2                x3        x3

          x1        x1       x1        x2         x2      x2        x3        x3

                                            Cycle                   Cycle


 Latency cycle : the sequence of initiations which has repetitive
                 subsequence and without collisions

 Latency sequence length : the number of time intervals
                           within the cycle

 Average latency : the sum of all latencies divided by
                    the number of latencies along the cycle
                                                                                        CS211 20
18

              Collision Free Scheduling




Goal : to find the shortest average latency

Lengths : for reservation table with n columns, maximum forbidden
       latency is m <= n – 1, and permissible latency p is
       1 <= p <= m – 1

Ideal case : p = 1 (static pipeline)

Collision vector : C = (CmCm-1 . . .C2C1)
                   [ Ci = 1 if latency i causes collision ]
                   [ Ci = 0 for permissible latencies ]




                                                                    CS211 21
19

                     Collision Vector

                          Reservation Table

                 x                                 x

                      x           x

                            x           x




x1                         x1                                               x2
                                              x2
      x1        x1                                     x2        x2
           x1        x1                                     x2        x2
Value X1                                      Value X2

                     C = (? ? . . . ? ?)
                                                                           CS211 22
Back to our focus: Computer
Pipelines
 • Execute billions of instructions, so
   throughput is what matters
 • MIPS desirable features:
    – all instructions same length,
    – registers located in same place in instruction
      format,
    – memory operands only in loads or stores




                                                       CS211 23
Designing a Pipelined
Processor
• Go back and examine your datapath and control
  diagram
• associated resources with states
• ensure that flows do not conflict, or figure out how
  to resolve
• assert control in appropriate stage




                                                    CS211 24
5 Steps of MIPS Datapath
What do we need to do to pipeline the process ?
            Instruction           Instr. Decode         Execute           Memory        Write
              Fetch                Reg. Fetch          Addr. Calc         Access        Back
    Next PC




                                                                          MUX
                                  Next SEQ PC
                 Adder




             4                       RS1
                                                                  Zero?




                                                        MUX MUX
                                     RS2
  Address




                  Memory




                                            Reg File
                           Inst




                                                                  ALU




                                                                           Memory
                                     RD                                             L




                                                                            Data
                                                                                    M




                                                                                         MUX
                                                                                    D
                                            Sign
                                    Imm    Extend




                                                         WB Data


                                                                                         CS211 25
5 Steps of MIPS/DLX Datapath

           Instruction            Instr. Decode                  Execute                    Memory             Write
             Fetch                 Reg. Fetch                   Addr. Calc                  Access             Back
 Next PC




                                                                                            MUX
                                  Next SEQ PC                   Next SEQ PC
                  Adder




            4                        RS1
                                                                           Zero?




                                                                 MUX MUX




                                                                                                      MEM/WB
 Address



                Memory




                                                                                   EX/MEM
                                     RS2




                                             Reg File

                                                        ID/EX
                          IF/ID




                                                                           ALU




                                                                                             Memory
                                                                                              Data



                                                                                                                MUX


                                                                                                                       WB Data
                                             Sign
                                            Extend
                                      Imm


                                             RD                    RD                        RD




• Data stationary control
    – local decode for each instruction phase / pipeline stage                                                 CS211 26
Graphically Representing
Pipelines




• Can help with answering questions like:
   – how many cycles does it take to execute this code?
   – what is the ALU doing during cycle 4?
   – use this representation to help understand datapaths

                                                            CS211 27
Visualizing Pipelining

                          Time (clock cycles)

       Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
I




                            ALU
n       Ifetch    Reg              DMem     Reg

s
t
r.



                                      ALU
                 Ifetch    Reg              DMem    Reg



O
r



                                              ALU
                          Ifetch    Reg             DMem    Reg

d
e
r



                                                      ALU
                                   Ifetch    Reg            DMem   Reg




                                                                         CS211 28
Conventional Pipelined Execution
  Representation
 Time

IFetch Dcd     Exec   Mem    WB

       IFetch Dcd     Exec   Mem    WB

               IFetch Dcd    Exec   Mem    WB

                      IFetch Dcd    Exec   Mem    WB

                             IFetch Dcd    Exec   Mem    WB
Program Flow
                                    IFetch Dcd    Exec   Mem   WB




                                                               CS211 29
Single Cycle, Multiple Cycle, vs.
Clk
      PipelineCycle 1           Cycle 2


Single Cycle Implementation:
                       Load                                    Store          Waste


      Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk

Multiple Cycle Implementation:
      Load                                     Store                            R-type
       Ifetch   Reg      Exec    Mem    Wr      Ifetch   Reg     Exec   Mem      Ifetch


Pipeline Implementation:
Load Ifetch     Reg      Exec    Mem    Wr

        Store Ifetch       Reg   Exec   Mem     Wr

                R-type Ifetch     Reg   Exec    Mem      Wr
                                                                                      CS211 30
The Five Stages of
    Load Cycle 1 Cycle 2 Cycle 3        Cycle 4 Cycle 5



         Load Ifetch   Reg/Dec   Exec    Mem      Wr



• Ifetch: Instruction Fetch
     – Fetch the instruction from the Instruction Memory
•   Reg/Dec: Registers Fetch and Instruction Decode
•   Exec: Calculate the memory address
•   Mem: Read the data from the Data Memory
•   Wr: Write the data back to the register file




                                                           CS211 31
The Four Stages of R-
  type Cycle 1 Cycle 2 Cycle 3 Cycle 4
      R-type Ifetch   Reg/Dec   Exec   Wr



• Ifetch: Instruction Fetch
   – Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec:
   – ALU operates on the two register operands
   – Update PC
• Wr: Write the ALU output back to the register file



                                                         CS211 32
Pipelining the R-type and Load
        Instruction 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
          Cycle 1 Cycle 2 Cycle                                                     Cycle 9
Clock

R-type Ifetch      Reg/Dec    Exec      Wr                           Ops! We have a problem!

          R-type   Ifetch    Reg/Dec    Exec      Wr

                     Load    Ifetch    Reg/Dec    Exec      Mem      Wr

                              R-type Ifetch      Reg/Dec    Exec     Wr

                                        R-type Ifetch      Reg/Dec   Exec    Wr




 • We have pipeline conflict or structural hazard:
        – Two instructions try to write to the register file at the same
          time!
        – Only one write port

                                                                                        CS211 33
Important
•   Observation
    Each functional unit can only be used once per
  instruction
• Each functional unit must be used at the same stage
  for all instructions:
     – Load uses Register File’s Write Port during its 5th stage
                       1       2         3     4     5
             Load   Ifetch   Reg/Dec   Exec   Mem   Wr



     – R-type uses Register File’s Write Port during its 4th stage
                       1        2        3     4
             R-type Ifetch   Reg/Dec   Exec   Wr

° 2 ways to solve this pipeline hazard.




                                                                     CS211 34
Solution 1: Insert “Bubble” into the
        Pipeline2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
         Cycle 1 Cycle
Clock
          Ifetch   Reg/Dec    Exec      Wr
           Load    Ifetch    Reg/Dec    Exec     Mem    Wr

                    R-type Ifetch      Reg/Dec   Exec        Wr

                              R-type Ifetch Reg/Dec Pipeline Exec      Wr
                                      R-type Ifetch  Bubble Reg/Dec    Exec     Wr
                                                             Ifetch   Reg/Dec   Exec


 • Insert a “bubble” into the pipeline to prevent 2 writes
   at the same cycle
        – The control logic can be complex.
        – Lose instruction fetch and issue opportunity.
 • No instruction is started in Cycle 6!
                                                                                     CS211 35
Solution 2: Delay R-type’s Write by
 •    One Cycle
     Delay R-type’s register write by one cycle:
        – Now R-type instructions also use Reg File’s write port at Stage 5
        – Mem stage is a NOOP stage: nothing is being done.
                              1          2         3        4          5
                   R-type Ifetch     Reg/Dec      Exec     Mem       Wr



          Cycle 1 Cycle 2     Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock

R-type Ifetch       Reg/Dec       Exec    Mem       Wr

          R-type     Ifetch   Reg/Dec     Exec      Mem       Wr

                      Load    Ifetch     Reg/Dec    Exec      Mem          Wr

                                  R-type Ifetch    Reg/Dec    Exec         Mem    Wr

                                          R-type Ifetch      Reg/Dec       Exec   Mem   Wr

                                                                                             CS211 36
Why
Pipeline?
• Suppose we execute 100 instructions
• Single Cycle Machine
   – 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
• Multicycle Machine
   – 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns
• Ideal pipelined machine
   – 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns




                                                                  CS211 37
Why Pipeline? Because the resources
     are there!   Time (clock cycles)




                         ALU
 I            Im   Reg         Dm     Reg
n    Inst 0
s




                                ALU
 t   Inst 1        Im    Reg          Dm     Reg
r.




                                      ALU
O    Inst 2              Im    Reg          Dm     Reg
r
d    Inst 3


                                            ALU
                               Im     Reg          Dm     Reg
e
r
     Inst 4


                                                    ALU
                                      Im    Reg           Dm    Reg




                                                                      CS211 38
Problems with Pipeline
processors?
• Limits to pipelining: Hazards prevent next instruction from
  executing during its designated clock cycle and introduce
  stall cycles which increase CPI
   – Structural hazards: HW cannot support this combination of
     instructions - two dogs fighting for the same bone
   – Data hazards: Instruction depends on result of prior
     instruction still in the pipeline
       » Data dependencies
   – Control hazards: Caused by delay between the fetching of
     instructions and decisions about changes in control flow
     (branches and jumps).
       » Control dependencies
• Can always resolve hazards by stalling
• More stall cycles = more CPU time = less performance
   – Increase performance = decrease stall cycles



                                                                 CS211 39
Back to our old friend: CPU time
equation
 • Recall equation for CPU time




 • So what are we doing by pipelining the instruction
   execution process ?
    – Clock ?
    – Instruction Count ?
    – CPI ?
        » How is CPI effected by the various hazards ?


                                                         CS211 40
Speed Up Equation for Pipelining

 CPIpipelined = Ideal CPI + Average Stall cycles per Inst


            Ideal CPI × Pipeline depth      Cycle Timeunpipelined
 Speedup =                                ×
           Ideal CPI + Pipeline stall CPI    Cycle Timepipelined


For simple RISC pipeline, CPI = 1:

               Pipeline depth       Cycle Timeunpipelined
 Speedup =                        ×
           1 + Pipeline stall CPI    Cycle Timepipelined

                                                            CS211 41
One Memory Port/Structural Hazards

      Time (clock cycles)
         Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7


I Load Ifetch


                             ALU
                   Reg              DMem      Reg


n
s



                                       ALU
                                                      Reg
t Instr 1
                  Ifetch    Reg              DMem



r.




                                                ALU
                           Ifetch    Reg              DMem    Reg
     Instr 2
O
r



                                                        ALU
                                    Ifetch    Reg             DMem    Reg
d    Instr 3
e
r



                                                                ALU
                                             Ifetch    Reg            DMem       Reg
     Instr 4
                                                                             CS211 42
One Memory Port/Structural Hazards

      Time (clock cycles)
         Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7


I Load Ifetch


                             ALU
                   Reg                DMem      Reg


n
s



                                        ALU
                                                        Reg
t Instr 1
                  Ifetch    Reg                DMem



r.




                                                  ALU
                           Ifetch      Reg              DMem    Reg
     Instr 2
O
r
     Stall                          Bubble    Bubble Bubble    Bubble   Bubble
d
e
r



                                                                  ALU
                                               Ifetch    Reg            DMem       Reg
     Instr 3
                                                                               CS211 43
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory (“Harvard
  Architecture”)
• Machine B: Single ported memory, but its
  pipelined implementation has a 1.05 times faster
  clock rate
• Ideal CPI = 1 for both
• Note - Loads will cause stalls of 1 cycle
• Recall our friend:
  – CPU = IC*CPI*Clk
     » CPI= ideal CPI + stalls

                                             CS211 44
Example…

 • Machine A: Dual ported memory (“Harvard
   Architecture”)
 • Machine B: Single ported memory, but its pipelined
   implementation has a 1.05 times faster clock rate
 • Ideal CPI = 1 for both
 • Loads are 40% of instructions executed
        SpeedUpA = Pipe. Depth/(1 + 0) x (clockunpipe/clockpipe)
                    = Pipeline Depth
        SpeedUpB = Pipe. Depth/(1 + 0.4 x 1) x
                (clockunpipe/(clockunpipe / 1.05)
                   = (Pipe. Depth/1.4) x 1.05
                   = 0.75 x Pipe. Depth
        SpeedUpA / SpeedUpB = Pipe. Depth/(0.75 x Pipe. Depth) = 1.33
 • Machine A is 1.33 times faster
                                                                        CS211 45
Data Dependencies

 • True dependencies and False dependencies
   – false implies we can remove the dependency
   – true implies we are stuck with it!
 • Three types of data dependencies defined in
   terms of how succeeding instruction depends
   on preceding instruction
   – RAW: Read after Write or Flow dependency
   – WAR: Write after Read or anti-dependency
   – WAW: Write after Write



                                                  CS211 46
Three Generic Data Hazards

• Read After Write (RAW)
  InstrJ tries to read operand before InstrI writes it


            I: add r1,r2,r3
            J: sub r4,r1,r3

• Caused by a “Dependence” (in compiler
  nomenclature). This hazard results from an actual
  need for communication.



                                                         CS211 47
RAW Dependency

 • Example program (a) with two instructions
    – i1: load r1, a;
    – i2: add r2, r1,r1;
 • Program (b) with two instructions
    – i1: mul r1, r4, r5;
    – i2: add r2, r1, r1;
 • Both cases we cannot read in i2 until i1 has
   completed writing the result
    – In (a) this is due to load-use dependency
    – In (b) this is due to define-use dependency


                                                    CS211 48
Three Generic Data Hazards

• Write After Read (WAR)
  InstrJ writes operand before InstrI reads it
             I: sub r4,r1,r3
             J: add r1,r2,r3
             K: mul r6,r1,r7
• Called an “anti-dependence” by compiler writers.
  This results from reuse of the name “r1”.

• Can’t happen in MIPS 5 stage pipeline because:
   – All instructions take 5 stages, and
   – Reads are always in stage 2, and
   – Writes are always in stage 5
                                                     CS211 49
Three Generic Data Hazards
• Write After Write (WAW)
  InstrJ writes operand before InstrI writes it.
• Called an “output dependence” by compiler writers
  This also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
   – All instructions take 5 stages, and
   – Writes are always in stage 5
• Will see WAR and WAW in later more complicated
  pipes

                         I: sub r1,r4,r3
                         J: add r1,r2,r3
                         K: mul r6,r1,r7
                                                      CS211 50
WAR and WAW Dependency
• Example program (a):
  – i1: mul r1, r2, r3;
  – i2: add r2, r4, r5;
• Example program (b):
  – i1: mul r1, r2, r3;
  – i2: add r1, r4, r5;
• both cases we have dependence between i1 and i2
  – in (a) due to r2 must be read before it is written into
  – in (b) due to r1 must be written by i2 after it has been written
    into by i1




                                                               CS211 51
What to do with WAR and WAW ?
 • Problem:
    – i1: mul r1, r2, r3;
    – i2: add r2, r4, r5;
 • Is this really a dependence/hazard ?




                                          CS211 52
What to do with WAR and WAW

 • Solution: Rename Registers
   – i1: mul r1, r2, r3;
   – i2: add r6, r4, r5;
 • Register renaming can solve many of these
   false dependencies
   – note the role that the compiler plays in this
   – specifically, the register allocation process--i.e., the
     process that assigns registers to variables




                                                                CS211 53
Hazard Detection in H/W

 • Suppose instruction i is about to be issued and a
   predecessor instruction j is in the instruction pipeline
 • How to detect and store potential hazard information
    – Note that hazards in machine code are based on register
      usage
    – Keep track of results in registers and their usage
        » Constructing a register data flow graph
 • For each instruction i construct set of Read registers
   and Write registers
    – Rregs(i) is set of registers that instruction i reads from
    – Wregs(i) is set of registers that instruction i writes to
    – Use these to define the 3 types of data hazards


                                                                   CS211 54
Hazard Detection in
   Hardware
• A RAW hazard exists on register ρ if ρ ∈ Rregs( i ) ∩ Wregs( j )
   – Keep a record of pending writes (for inst's in the pipe) and
     compare with operand regs of current instruction.
   – When instruction issues, reserve its result register.
   – When on operation completes, remove its write reservation.




• A WAW hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Wregs( j )
• A WAR hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Rregs( j )




                                                                 CS211 55
Internal Forwarding: Getting rid of
some hazards
 • In some cases the data needed by the next
   instruction at the ALU stage has been
   computed by the ALU (or some stage defining
   it) but has not been written back to the
   registers
 • Can we “forward” this result by bypassing
   stages ?




                                                 CS211 56
Data Hazard on R1
      Time (clock cycles)

                       IF ID/RF EX               MEM      WB

I




                                          ALU
                                                          Reg
     add r1,r2,r3     Ifetch    Reg              DMem

n
s
t




                                                   ALU
                               Ifetch    Reg              DMem     Reg
     sub r4,r1,r3
r.




                                                            ALU
O                                       Ifetch    Reg              DMem    Reg
     and r6,r1,r7
r
d




                                                                     ALU
                                                 Ifetch    Reg             DMem    Reg
e    or    r8,r1,r9
r




                                                                             ALU
                                                          Ifetch    Reg            DMem   Reg
     xor r10,r1,r11
                                                                                    CS211 57
Forwarding to Avoid Data Hazard

                   Time (clock cycles)
 I
 n




                                      ALU
     add r1,r2,r3 Ifetch    Reg              DMem     Reg

 s
 t
r.




                                               ALU
     sub r4,r1,r3          Ifetch    Reg              DMem     Reg



O
r




                                                        ALU
                                    Ifetch    Reg              DMem    Reg
d    and r6,r1,r7
e
r




                                                                 ALU
                                             Ifetch    Reg             DMem    Reg
     or    r8,r1,r9




                                                                         ALU
                                                      Ifetch    Reg            DMem      Reg
     xor r10,r1,r11

                                                                                     CS211 58
Internal Forwarding of
Instructions
 • Forward result from ALU/Execute unit to
   execute unit in next stage
 • Also can be used in cases of memory access
 • in some cases, operand fetched from memory has
   been computed previously by the program
     – can we “forward” this result to a later stage thus
       avoiding an extra read from memory ?
     – Who does this ?
 • Internal forwarding cases
     – Stage i to Stage i+k in pipeline
     – store-load forwarding
     – load-store forwarding
     – store-store forwarding
                                                            CS211 59
Internal Data                                        38


                     Forwarding

                         Store-load forwarding



            Memory                              Memory
              M                                   M




        Access Unit                         Access Unit

R1                         R2       R1                        R2

 STO M,R1             LD R2,M        STO M,R1             MOVE R2,R1




                                                                       CS211 60
Internal Data                                             39


                    Forwarding

                        Load-load forwarding




           Memory                                 Memory
             M                                      M




       Access Unit                              Access Unit

R1                        R2         R1                           R2

 LD R1,M             LD R2,M          LD R1,M                 MOVE R2,R1




                                                                           CS211 61
Internal Data                                      40


                   Forwarding

                         Store-store forwarding



          Memory                            Memory
            M                                 M




        Access Unit                       Access Unit

R1                         R2       R1                       R2

                                                        STO M,R2
 STO M, R1            STO M,R2




                                                                   CS211 62
HW Change for Forwarding


 NextPC
                        mux
    Registers




                                                      MEM/WR
                                    EX/MEM
                              ALU
                ID/EX




                                              Data
                        mux




                                             Memory




                                                                 mux
Immediate




                                                               CS211 63
What about memory
       operations?in the same stage,
º If instructions are initiated in order and
  operations always occur                       op Rd Ra Rb
 there can be no hazards between memory
 operations!
º What does delaying WB on arithmetic
  operations cost?
    – cycles ?                                  op Rd Ra Rb       A       B
    – hardware ?
º What about data dependence on loads?
   R1 <- R4 + R5
   R2 <- Mem[ R2 + I ]                            Rd          D       R
   R3 <- R2 + R1
  ⇒ “Delayed Loads”
                                                              Mem
º Can recognize this in decode stage and          Rd
  introduce bubble while stalling fetch stage
                                                                  T
º Tricky situation:                                to reg
    R1 <- Mem[ R2 + I ]                            file

    Mem[R3+34] <- R1
  Handle with bypass in memory stage!
                                                                          CS211 64
Data Hazard Even with Forwarding

     Time (clock cycles)


I




                                      ALU
     lw r1, 0(r2) Ifetch    Reg              DMem     Reg


n
s
t




                                                ALU
                           Ifetch    Reg              DMem    Reg
     sub r4,r1,r6
r.

O




                                                        ALU
                                    Ifetch     Reg            DMem   Reg
     and r6,r1,r7
r
d
e




                                                               ALU
                                             Ifetch    Reg           DMem        Reg

r    or    r8,r1,r9
                                                                            CS211 65
Data Hazard Even with Forwarding

         Time (clock cycles)
I
n
     lw r1, 0(r2)




                                          ALU
                      Ifetch    Reg               DMem    Reg
s
t
r.




                                                            ALU
     sub r4,r1,r6              Ifetch    Reg     Bubble            DMem    Reg

O
r
d                                                Bubble




                                                                     ALU
                                        Ifetch             Reg             DMem      Reg
e    and r6,r1,r7
r




                                                                             ALU
                                                 Bubble   Ifetch   Reg              DMem
     or r8,r1,r9

                                                                                   CS211 66
What can we (S/W) do?




                        CS211 67
Software Scheduling to Avoid Load
Hazards
  Try producing fast code for
        a = b + c;
        d = e – f;
  assuming a, b, c, d ,e, and f in memory.
  Slow code:              Fast code:
         LW    Rb,b               LW    Rb,b
         LW    Rc,c               LW    Rc,c
         ADD   Ra,Rb,Rc           LW    Re,e
         SW    a,Ra               ADD   Ra,Rb,Rc
         LW    Re,e               LW    Rf,f
         LW    Rf,f               SW    a,Ra
         SUB   Rd,Re,Rf           SUB   Rd,Re,Rf
                                                   CS211 68
         SW    d,Rd               SW    d,Rd
Control Hazards: Branches

• Instruction flow
   – Stream of instructions processed by Inst. Fetch unit
   – Speed of “input flow” puts bound on rate of
     outputs generated
• Branch instruction affects instruction flow
   – Do not know next instruction to be executed until
     branch outcome known
• When we hit a branch instruction
   – Need to compute target address (where to branch)
   – Resolution of branch condition (true or false)
   – Might need to ‘flush’ pipeline if other instructions
     have been fetched for execution

                                                            CS211 69
Control Hazard on Branches
   Three Stage Stall




                                         ALU
10: beq r1,r3,36     Ifetch    Reg              DMem     Reg




                                                  ALU
                              Ifetch    Reg              DMem     Reg
14: and r2,r3,r5




                                                           ALU
                                       Ifetch    Reg              DMem   Reg
18: or   r6,r1,r7




                                                                   ALU
                                                Ifetch    Reg            DMem   Reg
22: add r8,r1,r9




                                                                          ALU
                                                                                       Reg
36: xor r10,r1,r11                                       Ifetch    Reg          DMem



                                                                                 CS211 70
Branch Stall Impact

• If CPI = 1, 30% branch,
       Stall 3 cycles => new CPI = 1.9!
• Two part solution:
  – Determine branch taken or not sooner, AND
  – Compute taken branch address earlier
• MIPS branch tests if register = 0 or ≠ 0
• MIPS Solution:
  – Move Zero test to ID/RF stage
  – Adder to calculate new PC in ID/RF stage
  – 1 clock cycle penalty for branch versus 3

                                                CS211 71
Pipelined MIPS (DLX) Datapath

Instruction   Instr. Decode    Execute          Memory      Write
  Fetch        Reg. Fetch     Addr. Calc.       Access      Back
                              This is the correct 1 cycle
                              latency implementation!




                                                            CS211 72
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear – flushing pipe
#2: Predict Branch Not Taken
    –   Execute successor instructions in sequence
    –   “Squash” instructions in pipeline if branch actually taken
    –   Advantage of late pipeline state update
    –   47% DLX branches not taken on average
    –   PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
    – 53% DLX branches taken on average
    – But haven’t calculated branch target address in DLX
        » DLX still incurs 1 cycle branch penalty
        » Other machines: branch target known before outcome



                                                                     CS211 73
Four Branch Hazard Alternatives

#4: Delayed Branch
   – Define branch to take place AFTER a following instruction

     branch instruction
       sequential successor1
       sequential successor2
       ........
       sequential successorn             Branch delay of length n
     branch target if taken

   – 1 slot delay allows proper decision and branch target
     address in 5 stage pipeline
   – DLX uses this



                                                                 CS211 74
CS211 75
Delayed Branch
• Where to get instructions to fill branch delay slot?
   –   Before branch instruction
   –   From the target address: only valuable when branch taken
   –   From fall through: only valuable when branch not taken
   –   Cancelling branches allow more slots to be filled

• Compiler effectiveness for single branch delay slot:
   – Fills about 60% of branch delay slots
   – About 80% of instructions executed in branch delay slots useful
     in computation
   – About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines, multiple
  instructions issued per clock (superscalar)




                                                                  CS211 76
Evaluating Branch Alternatives
  P ip e lin e s p e e d u p =                     P ip e lin e d e p th
                                 1 + B r a n c h f r e q u e n c y × B r a n c h p e n a lty

Scheduling     Branch               CPI      speedup v.            speedup v.
  scheme       penalty                      unpipelined                  stall
Stall pipeline       3             1.42              3.5                  1.0
Predict taken        1             1.14              4.4                 1.26
Predict not taken    1             1.09              4.5                 1.29
Delayed branch     0.5             1.07              4.6                 1.31


Conditional & Unconditional = 14%, 65% change PC


                                                                                               CS211 77
Branch Prediction based on
history
 • Can we use history of branch behaviour to predict
   branch outcome ?
 • Simplest scheme: use 1 bit of “history”
    – Set bit to Predict Taken (T) or Predict Not-taken (NT)
    – Pipeline checks bit value and predicts
        » If incorrect then need to invalidate instruction
    – Actual outcome used to set the bit value
 • Example: let initial value = T, actual outcome of
   branches is- NT, T,T,NT,T,T
    – Predictions are: T, NT,T,T,NT,T
        » 3 wrong (in red), 3 correct = 50% accuracy
 • In general, can have k-bit predictors: more when we
   cover superscalar processors.

                                                               CS211 78
Summary :
  Control and Pipelining
• Just overlap tasks; easy if tasks are independent
• Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then:
                Pipeline depth       Cycle Timeunpipelined
  Speedup =                        ×
            1 + Pipeline stall CPI    Cycle Timepipelined

• Hazards limit performance on computers:
   – Structural: need more HW resources
   – Data (RAW,WAR,WAW): need forwarding, compiler
     scheduling
   – Control: delayed branch, prediction



                                                             CS211 79
Summary #1/2:
•Pipelining it easy
  What makes
   – all instructions are the same length
   – just a few instruction formats
   – memory operands appear only in loads and stores

• What makes it hard? HAZARDS!
   – structural hazards: suppose we had only one memory
   – control hazards: need to worry about branch instructions
   – data hazards: an instruction depends on a previous
     instruction
• Pipelines pass control information down the pipe just
  as data moves down pipe
• Forwarding/Stalls handled by local control
• Exceptions stop the pipeline

                                                           CS211 80
Introduction to ILP

 • What is ILP?
    – Processor and Compiler design techniques that
      speed up execution by causing individual machine
      operations to execute in parallel
 • ILP is transparent to the user
    – Multiple operations executed in parallel even
      though the system is handed a single program
      written with a sequential processor in mind
 • Same execution hardware as a normal RISC
   machine
    – May be more than one of any given type of
      hardware
                                                         CS211 81
Compiler vs. Processor
               Compiler                                                                Hardware

    Frontend and Optimizer
                                                Superscalar

     Determine Dependences                         Dataflow                      Determine Dependences


    Determine Independences                      Indep. Arch.                   Determine Independences


Bind Operations to Function Units                                          Bind Operations to Function Units
                                                    VLIW

   Bind Transports to Busses                                                    Bind Transports to Busses

                                                     TTA


                                                                                          Execute


      B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The
      Journal of Supercomputing, 7(1-2):9-50, May 1993.

                                                                                                                    CS211 82

Más contenido relacionado

La actualidad más candente

Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipeliningMazin Alwaaly
 
Evaluation of morden computer & system attributes in ACA
Evaluation of morden computer &  system attributes in ACAEvaluation of morden computer &  system attributes in ACA
Evaluation of morden computer & system attributes in ACAPankaj Kumar Jain
 
Pipelining powerpoint presentation
Pipelining powerpoint presentationPipelining powerpoint presentation
Pipelining powerpoint presentationbhavanadonthi
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessorKishan Panara
 
Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processingKamal Acharya
 
Shared-Memory Multiprocessors
Shared-Memory MultiprocessorsShared-Memory Multiprocessors
Shared-Memory MultiprocessorsSalvatore La Bua
 
Pipelining
PipeliningPipelining
PipeliningAmin Omi
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUBNusrat Mary
 
Control Units : Microprogrammed and Hardwired:control unit
Control Units : Microprogrammed and Hardwired:control unitControl Units : Microprogrammed and Hardwired:control unit
Control Units : Microprogrammed and Hardwired:control unitabdosaidgkv
 
program flow mechanisms, advanced computer architecture
program flow mechanisms, advanced computer architectureprogram flow mechanisms, advanced computer architecture
program flow mechanisms, advanced computer architecturePankaj Kumar Jain
 
Datapath Design of Computer Architecture
Datapath Design of Computer ArchitectureDatapath Design of Computer Architecture
Datapath Design of Computer ArchitectureAbu Zaman
 
Pipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture pptPipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture pptmali yogesh kumar
 
Pipelining
PipeliningPipelining
PipeliningAJAL A J
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureBalaji Vignesh
 
advanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelismadvanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelismPankaj Kumar Jain
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platformsSyed Zaid Irshad
 

La actualidad más candente (20)

Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipelining
 
Evaluation of morden computer & system attributes in ACA
Evaluation of morden computer &  system attributes in ACAEvaluation of morden computer &  system attributes in ACA
Evaluation of morden computer & system attributes in ACA
 
Pipelining powerpoint presentation
Pipelining powerpoint presentationPipelining powerpoint presentation
Pipelining powerpoint presentation
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
 
VLIW Processors
VLIW ProcessorsVLIW Processors
VLIW Processors
 
Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processing
 
Shared-Memory Multiprocessors
Shared-Memory MultiprocessorsShared-Memory Multiprocessors
Shared-Memory Multiprocessors
 
Pipelining
PipeliningPipelining
Pipelining
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
 
Control Units : Microprogrammed and Hardwired:control unit
Control Units : Microprogrammed and Hardwired:control unitControl Units : Microprogrammed and Hardwired:control unit
Control Units : Microprogrammed and Hardwired:control unit
 
program flow mechanisms, advanced computer architecture
program flow mechanisms, advanced computer architectureprogram flow mechanisms, advanced computer architecture
program flow mechanisms, advanced computer architecture
 
Datapath Design of Computer Architecture
Datapath Design of Computer ArchitectureDatapath Design of Computer Architecture
Datapath Design of Computer Architecture
 
Superscalar Processor
Superscalar ProcessorSuperscalar Processor
Superscalar Processor
 
Pipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture pptPipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture ppt
 
Pipelining
PipeliningPipelining
Pipelining
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
 
advanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelismadvanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelism
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 

Similar a Pipeline

BIL406-Chapter-7-Superscalar and Superpipeline processors.ppt
BIL406-Chapter-7-Superscalar and Superpipeline  processors.pptBIL406-Chapter-7-Superscalar and Superpipeline  processors.ppt
BIL406-Chapter-7-Superscalar and Superpipeline processors.pptKadri20
 
Computer Organozation
Computer OrganozationComputer Organozation
Computer OrganozationAabha Tiwari
 
pipelining ppt.pdf
pipelining ppt.pdfpipelining ppt.pdf
pipelining ppt.pdfWilliamTom9
 
Design pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelinesDesign pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelinesMahmudul Hasan
 
Pipelining in Computer System Achitecture
Pipelining in Computer System AchitecturePipelining in Computer System Achitecture
Pipelining in Computer System AchitectureYashiUpadhyay3
 
CS304PC:Computer Organization and Architecture Session 33 demo 1 ppt.pdf
CS304PC:Computer Organization and Architecture  Session 33 demo 1 ppt.pdfCS304PC:Computer Organization and Architecture  Session 33 demo 1 ppt.pdf
CS304PC:Computer Organization and Architecture Session 33 demo 1 ppt.pdfAsst.prof M.Gokilavani
 
Parallel Processing Techniques Pipelining
Parallel Processing Techniques PipeliningParallel Processing Techniques Pipelining
Parallel Processing Techniques PipeliningRNShukla7
 
Computer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.ppt
Computer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.pptComputer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.ppt
Computer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.pptHermellaGashaw
 
Unit 3-pipelining &amp; vector processing
Unit 3-pipelining &amp; vector processingUnit 3-pipelining &amp; vector processing
Unit 3-pipelining &amp; vector processingvishal choudhary
 
Advanced pipelining
Advanced pipeliningAdvanced pipelining
Advanced pipelininga_spac
 
QX Simulator and quantum programming - 2020-04-28
QX Simulator and quantum programming - 2020-04-28QX Simulator and quantum programming - 2020-04-28
QX Simulator and quantum programming - 2020-04-28Aritra Sarkar
 
Queuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depthQueuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depthIdcIdk1
 
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair UpdatesScylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair UpdatesScyllaDB
 
osdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdfosdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdfgmdvmk
 

Similar a Pipeline (20)

BIL406-Chapter-7-Superscalar and Superpipeline processors.ppt
BIL406-Chapter-7-Superscalar and Superpipeline  processors.pptBIL406-Chapter-7-Superscalar and Superpipeline  processors.ppt
BIL406-Chapter-7-Superscalar and Superpipeline processors.ppt
 
Pipelining slides
Pipelining slides Pipelining slides
Pipelining slides
 
Coa.ppt2
Coa.ppt2Coa.ppt2
Coa.ppt2
 
Computer Organozation
Computer OrganozationComputer Organozation
Computer Organozation
 
pipelining ppt.pdf
pipelining ppt.pdfpipelining ppt.pdf
pipelining ppt.pdf
 
Pipeline r014
Pipeline   r014Pipeline   r014
Pipeline r014
 
Design pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelinesDesign pipeline architecture for various stage pipelines
Design pipeline architecture for various stage pipelines
 
Pipelining in Computer System Achitecture
Pipelining in Computer System AchitecturePipelining in Computer System Achitecture
Pipelining in Computer System Achitecture
 
Unit - 5 Pipelining.pptx
Unit - 5 Pipelining.pptxUnit - 5 Pipelining.pptx
Unit - 5 Pipelining.pptx
 
CS304PC:Computer Organization and Architecture Session 33 demo 1 ppt.pdf
CS304PC:Computer Organization and Architecture  Session 33 demo 1 ppt.pdfCS304PC:Computer Organization and Architecture  Session 33 demo 1 ppt.pdf
CS304PC:Computer Organization and Architecture Session 33 demo 1 ppt.pdf
 
Parallel Processing Techniques Pipelining
Parallel Processing Techniques PipeliningParallel Processing Techniques Pipelining
Parallel Processing Techniques Pipelining
 
Computer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.ppt
Computer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.pptComputer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.ppt
Computer_Architecture_3rd_Edition_by_Moris_Mano_Ch_09.ppt
 
Unit 3-pipelining &amp; vector processing
Unit 3-pipelining &amp; vector processingUnit 3-pipelining &amp; vector processing
Unit 3-pipelining &amp; vector processing
 
Advanced pipelining
Advanced pipeliningAdvanced pipelining
Advanced pipelining
 
COA Unit-5.pptx
COA Unit-5.pptxCOA Unit-5.pptx
COA Unit-5.pptx
 
QX Simulator and quantum programming - 2020-04-28
QX Simulator and quantum programming - 2020-04-28QX Simulator and quantum programming - 2020-04-28
QX Simulator and quantum programming - 2020-04-28
 
Queuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depthQueuing theory and traffic analysis in depth
Queuing theory and traffic analysis in depth
 
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair UpdatesScylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
 
osdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdfosdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdf
 
Pipelineing idealisam
Pipelineing idealisamPipelineing idealisam
Pipelineing idealisam
 

Último

How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17Celine George
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxMYDA ANGELICA SUAN
 
Presentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphPresentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphNetziValdelomar1
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxraviapr7
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...raviapr7
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationMJDuyan
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptxraviapr7
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesMohammad Hassany
 
How to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesHow to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesCeline George
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17Celine George
 
How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17Celine George
 
UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxDr. Santhosh Kumar. N
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.EnglishCEIPdeSigeiro
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxAditiChauhan701637
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxraviapr7
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapitolTechU
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxDr. Asif Anas
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 

Último (20)

How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptx
 
Presentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphPresentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a Paragraph
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptx
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive Education
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming Classes
 
How to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesHow to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 Sales
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17
 
How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17
 
Finals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quizFinals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quiz
 
UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptx
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptx
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptx
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptx
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptx
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 

Pipeline

  • 1. CS 211: Computer Architecture Instructor: Prof. Bhagi Narahari Dept. of Computer Science Course URL: www.seas.gwu.edu/~narahari/cs211/ CS211 1
  • 2. How to improve performance? • Recall performance is function of – CPI: cycles per instruction – Clock cycle – Instruction count • Reducing any of the 3 factors will lead to improved performance CS211 2
  • 3. How to improve performance? • First step is to apply concept of pipelining to the instruction execution process – Overlap computations • What does this do? – Decrease clock cycle – Decrease effective CPU time compared to original clock cycle • Appendix A of Textbook – Also parts of Chapter 2 CS211 3
  • 4. Pipeline Approach to Improve System Performance • Analogous to fluid flow in pipelines and assembly line in factories • Divide process into “stages” and send tasks into a pipeline – Overlap computations of different tasks by operating on them concurrently in different stages CS211 4
  • 5. Instruction Pipeline • Instruction execution process lends itself naturally to pipelining – overlap the subtasks of instruction fetch, decode and execute CS211 5
  • 6. Linear Pipeline 3 Processor Linear pipeline processes a sequence of subtasks with linear precedence At a higher level - Sequence of processors Data flowing in streams from stage S1 to the final stage Sk Control of data flow : synchronous or asynchronous S1 S2 • • • • Sk CS211 6
  • 7. Synchronous Pipeline 4 All transfers simultaneous One task or operation enters the pipeline per cycle Processors reservation table : diagonal CS211 7
  • 8. Time Space Utilization of Pipeline Full pipeline after 4 cycles S3 T1 T2 T1 S2 T2 T3 T1 T2 T3 T4 S1 1 2 3 4 Time (in pipeline cycles) CS211 8
  • 9. Asynchronous 5 Pipeline Transfers performed when individual processors are ready Handshaking protocol between processors Mainly used in multiprocessor systems with message-passing CS211 9
  • 10. Pipeline Clock and 6 Timing Si Si+1 τ τm d Clock cycle of the pipeline : τ Latch delay : d τ = max {τ m } + d Pipeline frequency : f f=1/τ CS211 10
  • 11. Speedup and 7 Efficiency k-stage pipeline processes n tasks in k + (n-1) clock cycles: k cycles for the first task and n-1 cycles for the remaining n-1 tasks Total time to process n tasks Tk = [ k + (n-1)] τ For the non-pipelined processor T1 = n k τ Speedup factor T1 nkτ nk Sk = = [ k + (n-1)] τ = k + (n-1) Tk CS211 11
  • 12. 10 Efficiency and Throughput Efficiency of the k-stages pipeline : Sk n Ek = = k k + (n-1) Pipeline throughput (the number of tasks per unit time) : note equivalence to IPC n nf Hk = = [ k + (n-1)] τ k + (n-1) CS211 12
  • 13. Pipeline Performance: Example • Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 ns (nanoseconds) • latch delay = 10 • Pipeline cycle time = 90+10 = 100 ns • For non-pipelined execution – time = 60+50+90+80 = 280 ns • Speedup for above case is: 280/100 = 2.8 !! • Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns • Sequential time = 1000*280ns • Throughput= 1000/1003 • What is the problem here ? • How to improve performance ? CS211 13
  • 14. Non-linear pipelines and pipeline control algorithms • Can have non-linear path in pipeline… – How to schedule instructions so they do no conflict for resources • How does one control the pipeline at the microarchitecture level – How to build a scheduler in hardware ? – How much time does scheduler have to make decision ? CS211 14
  • 15. Non-linear Dynamic Pipelines • Multiple processors (k-stages) as linear pipeline • Variable functions of individual processors • Functions may be dynamically assigned • Feedforward and feedback connections CS211 15
  • 16. Reservation Tables • Reservation table : displays time-space flow of data through the pipeline analogous to opcode of pipeline – Not diagonal, as in linear pipelines • Multiple reservation tables for different functions – Functions may be dynamically assigned • Feedforward and feedback connections • The number of columns in the reservation table : evaluation time of a given function CS211 16
  • 18. Latency Analysis • Latency : the number of clock cycles between two initiations of the pipeline • Collision : an attempt by two initiations to use the same pipeline stage at the same time • Some latencies cause collision, some not CS211 18
  • 19. 16 Collisions (Example) 1 2 3 4 5 6 7 8 9 10 x1 x2 x3 x1 x4 x1x2 x2x3 x1 x1x2 x2x3 x3x4 x4 x1 x1x2 x1x2x4 x2x3x4 Latency = 2 CS211 19
  • 20. 17 Latency Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 x1 x1 x2 x1 x2 x3 x2 x3 x1 x1 x2 x2 x3 x3 x1 x1 x1 x2 x2 x2 x3 x3 Cycle Cycle Latency cycle : the sequence of initiations which has repetitive subsequence and without collisions Latency sequence length : the number of time intervals within the cycle Average latency : the sum of all latencies divided by the number of latencies along the cycle CS211 20
  • 21. 18 Collision Free Scheduling Goal : to find the shortest average latency Lengths : for reservation table with n columns, maximum forbidden latency is m <= n – 1, and permissible latency p is 1 <= p <= m – 1 Ideal case : p = 1 (static pipeline) Collision vector : C = (CmCm-1 . . .C2C1) [ Ci = 1 if latency i causes collision ] [ Ci = 0 for permissible latencies ] CS211 21
  • 22. 19 Collision Vector Reservation Table x x x x x x x1 x1 x2 x2 x1 x1 x2 x2 x1 x1 x2 x2 Value X1 Value X2 C = (? ? . . . ? ?) CS211 22
  • 23. Back to our focus: Computer Pipelines • Execute billions of instructions, so throughput is what matters • MIPS desirable features: – all instructions same length, – registers located in same place in instruction format, – memory operands only in loads or stores CS211 23
  • 24. Designing a Pipelined Processor • Go back and examine your datapath and control diagram • associated resources with states • ensure that flows do not conflict, or figure out how to resolve • assert control in appropriate stage CS211 24
  • 25. 5 Steps of MIPS Datapath What do we need to do to pipeline the process ? Instruction Instr. Decode Execute Memory Write Fetch Reg. Fetch Addr. Calc Access Back Next PC MUX Next SEQ PC Adder 4 RS1 Zero? MUX MUX RS2 Address Memory Reg File Inst ALU Memory RD L Data M MUX D Sign Imm Extend WB Data CS211 25
  • 26. 5 Steps of MIPS/DLX Datapath Instruction Instr. Decode Execute Memory Write Fetch Reg. Fetch Addr. Calc Access Back Next PC MUX Next SEQ PC Next SEQ PC Adder 4 RS1 Zero? MUX MUX MEM/WB Address Memory EX/MEM RS2 Reg File ID/EX IF/ID ALU Memory Data MUX WB Data Sign Extend Imm RD RD RD • Data stationary control – local decode for each instruction phase / pipeline stage CS211 26
  • 27. Graphically Representing Pipelines • Can help with answering questions like: – how many cycles does it take to execute this code? – what is the ALU doing during cycle 4? – use this representation to help understand datapaths CS211 27
  • 28. Visualizing Pipelining Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I ALU n Ifetch Reg DMem Reg s t r. ALU Ifetch Reg DMem Reg O r ALU Ifetch Reg DMem Reg d e r ALU Ifetch Reg DMem Reg CS211 28
  • 29. Conventional Pipelined Execution Representation Time IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB CS211 29
  • 30. Single Cycle, Multiple Cycle, vs. Clk PipelineCycle 1 Cycle 2 Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr CS211 30
  • 31. The Five Stages of Load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem Wr • Ifetch: Instruction Fetch – Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec: Calculate the memory address • Mem: Read the data from the Data Memory • Wr: Write the data back to the register file CS211 31
  • 32. The Four Stages of R- type Cycle 1 Cycle 2 Cycle 3 Cycle 4 R-type Ifetch Reg/Dec Exec Wr • Ifetch: Instruction Fetch – Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec: – ALU operates on the two register operands – Update PC • Wr: Write the ALU output back to the register file CS211 32
  • 33. Pipelining the R-type and Load Instruction 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 1 Cycle 2 Cycle Cycle 9 Clock R-type Ifetch Reg/Dec Exec Wr Ops! We have a problem! R-type Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr • We have pipeline conflict or structural hazard: – Two instructions try to write to the register file at the same time! – Only one write port CS211 33
  • 34. Important • Observation Each functional unit can only be used once per instruction • Each functional unit must be used at the same stage for all instructions: – Load uses Register File’s Write Port during its 5th stage 1 2 3 4 5 Load Ifetch Reg/Dec Exec Mem Wr – R-type uses Register File’s Write Port during its 4th stage 1 2 3 4 R-type Ifetch Reg/Dec Exec Wr ° 2 ways to solve this pipeline hazard. CS211 34
  • 35. Solution 1: Insert “Bubble” into the Pipeline2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 1 Cycle Clock Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Pipeline Exec Wr R-type Ifetch Bubble Reg/Dec Exec Wr Ifetch Reg/Dec Exec • Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle – The control logic can be complex. – Lose instruction fetch and issue opportunity. • No instruction is started in Cycle 6! CS211 35
  • 36. Solution 2: Delay R-type’s Write by • One Cycle Delay R-type’s register write by one cycle: – Now R-type instructions also use Reg File’s write port at Stage 5 – Mem stage is a NOOP stage: nothing is being done. 1 2 3 4 5 R-type Ifetch Reg/Dec Exec Mem Wr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr CS211 36
  • 37. Why Pipeline? • Suppose we execute 100 instructions • Single Cycle Machine – 45 ns/cycle x 1 CPI x 100 inst = 4500 ns • Multicycle Machine – 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns • Ideal pipelined machine – 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns CS211 37
  • 38. Why Pipeline? Because the resources are there! Time (clock cycles) ALU I Im Reg Dm Reg n Inst 0 s ALU t Inst 1 Im Reg Dm Reg r. ALU O Inst 2 Im Reg Dm Reg r d Inst 3 ALU Im Reg Dm Reg e r Inst 4 ALU Im Reg Dm Reg CS211 38
  • 39. Problems with Pipeline processors? • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle and introduce stall cycles which increase CPI – Structural hazards: HW cannot support this combination of instructions - two dogs fighting for the same bone – Data hazards: Instruction depends on result of prior instruction still in the pipeline » Data dependencies – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). » Control dependencies • Can always resolve hazards by stalling • More stall cycles = more CPU time = less performance – Increase performance = decrease stall cycles CS211 39
  • 40. Back to our old friend: CPU time equation • Recall equation for CPU time • So what are we doing by pipelining the instruction execution process ? – Clock ? – Instruction Count ? – CPI ? » How is CPI effected by the various hazards ? CS211 40
  • 41. Speed Up Equation for Pipelining CPIpipelined = Ideal CPI + Average Stall cycles per Inst Ideal CPI × Pipeline depth Cycle Timeunpipelined Speedup = × Ideal CPI + Pipeline stall CPI Cycle Timepipelined For simple RISC pipeline, CPI = 1: Pipeline depth Cycle Timeunpipelined Speedup = × 1 + Pipeline stall CPI Cycle Timepipelined CS211 41
  • 42. One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I Load Ifetch ALU Reg DMem Reg n s ALU Reg t Instr 1 Ifetch Reg DMem r. ALU Ifetch Reg DMem Reg Instr 2 O r ALU Ifetch Reg DMem Reg d Instr 3 e r ALU Ifetch Reg DMem Reg Instr 4 CS211 42
  • 43. One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I Load Ifetch ALU Reg DMem Reg n s ALU Reg t Instr 1 Ifetch Reg DMem r. ALU Ifetch Reg DMem Reg Instr 2 O r Stall Bubble Bubble Bubble Bubble Bubble d e r ALU Ifetch Reg DMem Reg Instr 3 CS211 43
  • 44. Example: Dual-port vs. Single-port • Machine A: Dual ported memory (“Harvard Architecture”) • Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate • Ideal CPI = 1 for both • Note - Loads will cause stalls of 1 cycle • Recall our friend: – CPU = IC*CPI*Clk » CPI= ideal CPI + stalls CS211 44
  • 45. Example… • Machine A: Dual ported memory (“Harvard Architecture”) • Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate • Ideal CPI = 1 for both • Loads are 40% of instructions executed SpeedUpA = Pipe. Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipe. Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipe. Depth/1.4) x 1.05 = 0.75 x Pipe. Depth SpeedUpA / SpeedUpB = Pipe. Depth/(0.75 x Pipe. Depth) = 1.33 • Machine A is 1.33 times faster CS211 45
  • 46. Data Dependencies • True dependencies and False dependencies – false implies we can remove the dependency – true implies we are stuck with it! • Three types of data dependencies defined in terms of how succeeding instruction depends on preceding instruction – RAW: Read after Write or Flow dependency – WAR: Write after Read or anti-dependency – WAW: Write after Write CS211 46
  • 47. Three Generic Data Hazards • Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 • Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. CS211 47
  • 48. RAW Dependency • Example program (a) with two instructions – i1: load r1, a; – i2: add r2, r1,r1; • Program (b) with two instructions – i1: mul r1, r4, r5; – i2: add r2, r1, r1; • Both cases we cannot read in i2 until i1 has completed writing the result – In (a) this is due to load-use dependency – In (b) this is due to define-use dependency CS211 48
  • 49. Three Generic Data Hazards • Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5 CS211 49
  • 50. Three Generic Data Hazards • Write After Write (WAW) InstrJ writes operand before InstrI writes it. • Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 • Will see WAR and WAW in later more complicated pipes I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 CS211 50
  • 51. WAR and WAW Dependency • Example program (a): – i1: mul r1, r2, r3; – i2: add r2, r4, r5; • Example program (b): – i1: mul r1, r2, r3; – i2: add r1, r4, r5; • both cases we have dependence between i1 and i2 – in (a) due to r2 must be read before it is written into – in (b) due to r1 must be written by i2 after it has been written into by i1 CS211 51
  • 52. What to do with WAR and WAW ? • Problem: – i1: mul r1, r2, r3; – i2: add r2, r4, r5; • Is this really a dependence/hazard ? CS211 52
  • 53. What to do with WAR and WAW • Solution: Rename Registers – i1: mul r1, r2, r3; – i2: add r6, r4, r5; • Register renaming can solve many of these false dependencies – note the role that the compiler plays in this – specifically, the register allocation process--i.e., the process that assigns registers to variables CS211 53
  • 54. Hazard Detection in H/W • Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline • How to detect and store potential hazard information – Note that hazards in machine code are based on register usage – Keep track of results in registers and their usage » Constructing a register data flow graph • For each instruction i construct set of Read registers and Write registers – Rregs(i) is set of registers that instruction i reads from – Wregs(i) is set of registers that instruction i writes to – Use these to define the 3 types of data hazards CS211 54
  • 55. Hazard Detection in Hardware • A RAW hazard exists on register ρ if ρ ∈ Rregs( i ) ∩ Wregs( j ) – Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction. – When instruction issues, reserve its result register. – When on operation completes, remove its write reservation. • A WAW hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Wregs( j ) • A WAR hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Rregs( j ) CS211 55
  • 56. Internal Forwarding: Getting rid of some hazards • In some cases the data needed by the next instruction at the ALU stage has been computed by the ALU (or some stage defining it) but has not been written back to the registers • Can we “forward” this result by bypassing stages ? CS211 56
  • 57. Data Hazard on R1 Time (clock cycles) IF ID/RF EX MEM WB I ALU Reg add r1,r2,r3 Ifetch Reg DMem n s t ALU Ifetch Reg DMem Reg sub r4,r1,r3 r. ALU O Ifetch Reg DMem Reg and r6,r1,r7 r d ALU Ifetch Reg DMem Reg e or r8,r1,r9 r ALU Ifetch Reg DMem Reg xor r10,r1,r11 CS211 57
  • 58. Forwarding to Avoid Data Hazard Time (clock cycles) I n ALU add r1,r2,r3 Ifetch Reg DMem Reg s t r. ALU sub r4,r1,r3 Ifetch Reg DMem Reg O r ALU Ifetch Reg DMem Reg d and r6,r1,r7 e r ALU Ifetch Reg DMem Reg or r8,r1,r9 ALU Ifetch Reg DMem Reg xor r10,r1,r11 CS211 58
  • 59. Internal Forwarding of Instructions • Forward result from ALU/Execute unit to execute unit in next stage • Also can be used in cases of memory access • in some cases, operand fetched from memory has been computed previously by the program – can we “forward” this result to a later stage thus avoiding an extra read from memory ? – Who does this ? • Internal forwarding cases – Stage i to Stage i+k in pipeline – store-load forwarding – load-store forwarding – store-store forwarding CS211 59
  • 60. Internal Data 38 Forwarding Store-load forwarding Memory Memory M M Access Unit Access Unit R1 R2 R1 R2 STO M,R1 LD R2,M STO M,R1 MOVE R2,R1 CS211 60
  • 61. Internal Data 39 Forwarding Load-load forwarding Memory Memory M M Access Unit Access Unit R1 R2 R1 R2 LD R1,M LD R2,M LD R1,M MOVE R2,R1 CS211 61
  • 62. Internal Data 40 Forwarding Store-store forwarding Memory Memory M M Access Unit Access Unit R1 R2 R1 R2 STO M,R2 STO M, R1 STO M,R2 CS211 62
  • 63. HW Change for Forwarding NextPC mux Registers MEM/WR EX/MEM ALU ID/EX Data mux Memory mux Immediate CS211 63
  • 64. What about memory operations?in the same stage, º If instructions are initiated in order and operations always occur op Rd Ra Rb there can be no hazards between memory operations! º What does delaying WB on arithmetic operations cost? – cycles ? op Rd Ra Rb A B – hardware ? º What about data dependence on loads? R1 <- R4 + R5 R2 <- Mem[ R2 + I ] Rd D R R3 <- R2 + R1 ⇒ “Delayed Loads” Mem º Can recognize this in decode stage and Rd introduce bubble while stalling fetch stage T º Tricky situation: to reg R1 <- Mem[ R2 + I ] file Mem[R3+34] <- R1 Handle with bypass in memory stage! CS211 64
  • 65. Data Hazard Even with Forwarding Time (clock cycles) I ALU lw r1, 0(r2) Ifetch Reg DMem Reg n s t ALU Ifetch Reg DMem Reg sub r4,r1,r6 r. O ALU Ifetch Reg DMem Reg and r6,r1,r7 r d e ALU Ifetch Reg DMem Reg r or r8,r1,r9 CS211 65
  • 66. Data Hazard Even with Forwarding Time (clock cycles) I n lw r1, 0(r2) ALU Ifetch Reg DMem Reg s t r. ALU sub r4,r1,r6 Ifetch Reg Bubble DMem Reg O r d Bubble ALU Ifetch Reg DMem Reg e and r6,r1,r7 r ALU Bubble Ifetch Reg DMem or r8,r1,r9 CS211 66
  • 67. What can we (S/W) do? CS211 67
  • 68. Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: Fast code: LW Rb,b LW Rb,b LW Rc,c LW Rc,c ADD Ra,Rb,Rc LW Re,e SW a,Ra ADD Ra,Rb,Rc LW Re,e LW Rf,f LW Rf,f SW a,Ra SUB Rd,Re,Rf SUB Rd,Re,Rf CS211 68 SW d,Rd SW d,Rd
  • 69. Control Hazards: Branches • Instruction flow – Stream of instructions processed by Inst. Fetch unit – Speed of “input flow” puts bound on rate of outputs generated • Branch instruction affects instruction flow – Do not know next instruction to be executed until branch outcome known • When we hit a branch instruction – Need to compute target address (where to branch) – Resolution of branch condition (true or false) – Might need to ‘flush’ pipeline if other instructions have been fetched for execution CS211 69
  • 70. Control Hazard on Branches Three Stage Stall ALU 10: beq r1,r3,36 Ifetch Reg DMem Reg ALU Ifetch Reg DMem Reg 14: and r2,r3,r5 ALU Ifetch Reg DMem Reg 18: or r6,r1,r7 ALU Ifetch Reg DMem Reg 22: add r8,r1,r9 ALU Reg 36: xor r10,r1,r11 Ifetch Reg DMem CS211 70
  • 71. Branch Stall Impact • If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! • Two part solution: – Determine branch taken or not sooner, AND – Compute taken branch address earlier • MIPS branch tests if register = 0 or ≠ 0 • MIPS Solution: – Move Zero test to ID/RF stage – Adder to calculate new PC in ID/RF stage – 1 clock cycle penalty for branch versus 3 CS211 71
  • 72. Pipelined MIPS (DLX) Datapath Instruction Instr. Decode Execute Memory Write Fetch Reg. Fetch Addr. Calc. Access Back This is the correct 1 cycle latency implementation! CS211 72
  • 73. Four Branch Hazard Alternatives #1: Stall until branch direction is clear – flushing pipe #2: Predict Branch Not Taken – Execute successor instructions in sequence – “Squash” instructions in pipeline if branch actually taken – Advantage of late pipeline state update – 47% DLX branches not taken on average – PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken – 53% DLX branches taken on average – But haven’t calculated branch target address in DLX » DLX still incurs 1 cycle branch penalty » Other machines: branch target known before outcome CS211 73
  • 74. Four Branch Hazard Alternatives #4: Delayed Branch – Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn Branch delay of length n branch target if taken – 1 slot delay allows proper decision and branch target address in 5 stage pipeline – DLX uses this CS211 74
  • 76. Delayed Branch • Where to get instructions to fill branch delay slot? – Before branch instruction – From the target address: only valuable when branch taken – From fall through: only valuable when branch not taken – Cancelling branches allow more slots to be filled • Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots – About 80% of instructions executed in branch delay slots useful in computation – About 50% (60% x 80%) of slots usefully filled • Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar) CS211 76
  • 77. Evaluating Branch Alternatives P ip e lin e s p e e d u p = P ip e lin e d e p th 1 + B r a n c h f r e q u e n c y × B r a n c h p e n a lty Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline 3 1.42 3.5 1.0 Predict taken 1 1.14 4.4 1.26 Predict not taken 1 1.09 4.5 1.29 Delayed branch 0.5 1.07 4.6 1.31 Conditional & Unconditional = 14%, 65% change PC CS211 77
  • 78. Branch Prediction based on history • Can we use history of branch behaviour to predict branch outcome ? • Simplest scheme: use 1 bit of “history” – Set bit to Predict Taken (T) or Predict Not-taken (NT) – Pipeline checks bit value and predicts » If incorrect then need to invalidate instruction – Actual outcome used to set the bit value • Example: let initial value = T, actual outcome of branches is- NT, T,T,NT,T,T – Predictions are: T, NT,T,T,NT,T » 3 wrong (in red), 3 correct = 50% accuracy • In general, can have k-bit predictors: more when we cover superscalar processors. CS211 78
  • 79. Summary : Control and Pipelining • Just overlap tasks; easy if tasks are independent • Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then: Pipeline depth Cycle Timeunpipelined Speedup = × 1 + Pipeline stall CPI Cycle Timepipelined • Hazards limit performance on computers: – Structural: need more HW resources – Data (RAW,WAR,WAW): need forwarding, compiler scheduling – Control: delayed branch, prediction CS211 79
  • 80. Summary #1/2: •Pipelining it easy What makes – all instructions are the same length – just a few instruction formats – memory operands appear only in loads and stores • What makes it hard? HAZARDS! – structural hazards: suppose we had only one memory – control hazards: need to worry about branch instructions – data hazards: an instruction depends on a previous instruction • Pipelines pass control information down the pipe just as data moves down pipe • Forwarding/Stalls handled by local control • Exceptions stop the pipeline CS211 80
  • 81. Introduction to ILP • What is ILP? – Processor and Compiler design techniques that speed up execution by causing individual machine operations to execute in parallel • ILP is transparent to the user – Multiple operations executed in parallel even though the system is handed a single program written with a sequential processor in mind • Same execution hardware as a normal RISC machine – May be more than one of any given type of hardware CS211 81
  • 82. Compiler vs. Processor Compiler Hardware Frontend and Optimizer Superscalar Determine Dependences Dataflow Determine Dependences Determine Independences Indep. Arch. Determine Independences Bind Operations to Function Units Bind Operations to Function Units VLIW Bind Transports to Busses Bind Transports to Busses TTA Execute B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May 1993. CS211 82

Notas del editor

  1. This presentation contains some slides about data security
  2. Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted.
  3. As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline.
  4. Let’s take a look at the R-type instructions. The R-type instruction does NOT access data memory so it only takes four clock cycles, or in our new pipeline terminology, four stages to complete. The Ifetch and Reg/Dec stages are identical to the Load instructions. Well they have to be because at this point, we do not know we have a R-type instruction yet. Instead of calculating the effective address during the Exec stage, the R-type instruction will use the ALU to operate on the register operands. The result of this ALU operation is written back to the register file during the Wr back stage.
  5. What happened if we try to pipeline the R-type instructions with the Load instructions? Well, we have a problem here!!! We end up having two instructions trying to write to the register file at the same time! Why do we have this problem (the write “bubble”)?
  6. The first solution is to insert a “bubble” into the pipeline AFTER the load instruction to push back every instruction after the load that are already in the pipeline by one cycle. At the same time, the bubble will delay the Instruction Fetch of the instruction that is about to enter the pipeline by one cycle. Needless to say, the control logic to accomplish this can be complex. Furthermore, this solution also has a negative impact on performance. Notice that due to the “extra” stage (Mem) Load instruction has, we will not have one instruction finishes every cycle (points to Cycle 5). Consequently, a mix of load and R-type instruction will NOT have an average CPI of 1 because in effect, the Load instruction has an effective CPI of 2. So this is not that hot an idea Let’s try something else.
  7. Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline.