SlideShare una empresa de Scribd logo
1 de 9
Descargar para leer sin conexión
Datorarkitektur                                            Fö 7/8 - 1         Datorarkitektur                                                            Fö 7/8 - 2




                                                                                                  What is a Superscalar Architecture?



                        SUPERSCALAR PROCESSORS

                                                                                        •      A superscalar architecture is one in which several
                                                                                               instructions can be initiated simultaneously and
                                                                                               executed independently.
                                                                                        •      Pipelining allows several instructions to be
                                                                                               executed at the same time, but they have to be in
                 1. What is a Superscalar Architecture?                                        different pipeline stages at a given moment.
                                                                                        •      Superscalar architectures include all features of
                 2. Superpipelining                                                            pipelining but, in addition, there can be several
                                                                                               instructions executing simultaneously in the same
                 3. Features of Superscalar Architectures                                      pipeline stage.
                                                                                               They have the ability to initiate multiple instructions
                                                                                               during the same clock cycle.
                 4. Data Dependencies

                 5. Policies for Parallel Instruction Execution                     There are two typical approaches today, in order to
                                                                                    improve performance:
                 6. Register Renaming                                                 1. Superpipelining
                                                                                      2. Superscalar




Petru Eles, IDA, LiTH                                                         Petru Eles, IDA, LiTH




      Datorarkitektur                                            Fö 7/8 - 3         Datorarkitektur                                                            Fö 7/8 - 4




                               Superpipelining                                      Pipelined execution
                                                                                    Clock cycle →    1 2                3     4     5     6     7    8     9 10 11

                                                                                    Instr. i                FI DI CO FO EI WO
                                                                                    Instr. i+1                    FI DI CO FO EI WO
                                                                                    Instr. i+2                          FI DI CO FO EI WO
          •      Superpipelining is based on dividing the stages of a
                 pipeline into substages and thus increasing the                    Instr. i+3                               FI DI CO FO EI WO
                 number of instructions which are supported by the                  Instr. i+4                                     FI DI CO FO EI WO
                 pipeline at a given moment.
                                                                                    Instr. i+5                                           FI DI CO FO EI WO
                                                                                    Superpipelined execution
          •      By dividing each stage into two, the clock cycle
                 period τ will be reduced to the half, τ/2; hence, at               Clock cycle →   1 2 3 4                         5     6     7    8     9 10 11
                 the maximum capacity, the pipeline produces a
                                                                                    Instr. i                FI FI DI DI COCO FO FO EI EI WOWO
                 result every τ/2 s.                                                                        1 2 1 2 1 2 1 2 1 2 1 2
                                                                                                               FI FI DI DI COCO FO FO EI EI WOWO
                                                                                    Instr. i+1                 1 2 1 2 1 2 1 2 1 2 1 2
                                                                                    Instr. i+2                    FI FI DI DI COCO FO FO EI EI WOWO
          •      For a given architecture and the corresponding                                                   1 2 1 2 1 2 1 2 1 2 1 2
                 instruction set there is an optimal number of                      Instr. i+3                       FI FI DI DI COCO FO FO EI EI WOWO
                                                                                                                     1 2 1 2 1 2 1 2 1 2 1 2
                 pipeline stages; increasing the number of stages                                                       FI FI DI DI COCO FO FO EI EI WOWO
                 over this limit reduces the overall performance.                   Instr. i+4                           1 2 1 2 1 2 1 2 1 2 1 2
                                                                                    Instr. i+5                             FI FI DI DI COCO FO FO EI EI WOWO
                                                                                                                            1 2 1 2 1 2 1 2 1 2 1 2

          •      A solution to further improve speed is the                         Superscalar execution
                 superscalar architecture.                                          Clock cycle →   1 2 3                     4     5     6     7    8     9 10 11

                                                                                    Instr. i                FI DI CO FO EI WO
                                                                                    Instr. i+1              FI DI CO FO EI WO
                                                                                    Instr. i+2                    FI DI CO FO EI WO
                                                                                    Instr. i+3                    FI DI CO FO EI WO
                                                                                    Instr. i+4                          FI DI CO FO EI WO
                                                                                    Instr. i+5                          FI DI CO FO EI WO


Petru Eles, IDA, LiTH                                                         Petru Eles, IDA, LiTH
Datorarkitektur                                           Fö 7/8 - 5         Datorarkitektur                                                                                                     Fö 7/8 - 6




                           Superscalar Architectures
                                                                                                                                                      Memory




                                                                                                                                                                                                       Commit
                                                                                                                                                      point unit
          •      Superscalar architectures allow several instructions




                                                                                                                                                                       Integer



                                                                                                                                                                                           Integer
                                                                                                                                                      Floating




                                                                                                                                                                         unit



                                                                                                                                                                                             unit
                 to be issued and completed per clock cycle.




                                                                                     Superscalar Architectures (cont’d)
          •      A superscalar architecture consists of a number of
                 pipelines that are working in parallel.




                                                                                                                                Instr. buffer
          •      Depending on the number and kind of parallel units




                                                                                                                                                                                                Register
                                                                                                                                                                     issuing




                                                                                                                                                                                                 Files
                 available, a certain number of instructions can be                                                                                                Instruction
                 executed in parallel.




                                                                                                                                                                    (queues, reservation
          •      In the following example a floating point and two in-
                 teger operations can be issued and executed simul-




                                                                                                                                                                       stations, etc.)
                                                                                                                                                                       Instr. window
                 taneously; each unit is pipelined and can execute




                                                                                                                                      Branch pred.
                                                                                                                                      Addr. calc. &
                 several operations in different pipeline stages.




                                                                                                                                        Fetch &
                                                                                                                                 cache
                                                                                                                               Instruction




                                                                                                                                                                    Rename &
                                                                                                                                                                    Decode &

                                                                                                                                                                     Dispatch
Petru Eles, IDA, LiTH                                                        Petru Eles, IDA, LiTH




      Datorarkitektur                                           Fö 7/8 - 7         Datorarkitektur                                                                                                     Fö 7/8 - 8




                        Limitations on Parallel Execution                                                                 Limitations on Parallel Execution (cont’d)

                                                                                                       •                  Three categories of limitations have to be considered:

                                                                                                                           1. Resource conflicts:
          •      The situations which prevent instructions to be                                                                - They occur if two or more instructions
                 executed in parallel by a superscalar architecture                                                               compete for the same resource (register,
                 are very similar to those which prevent an efficient                                                              memory, functional unit) at the same time;
                 execution on any pipelined architecture (see                                                                     they are similar to structural hazards dis-
                 pipeline hazards - lectures 3, 4).                                                                               cussed with pipelines. Introducing several
                                                                                                                                  parallel pipelined units, superscalar archi-
                                                                                                                                  tectures try to reduce a part of possible re-
          •      The consequences of these situations on                                                                          source conflicts.
                 superscalar architectures are more severe than
                 those on simple pipelines, because the potential of                                                       2. Control (procedural) dependency:
                 parallelism in superscalars is greater and, thus, a                                                            - The presence of branches creates major
                 greater opportunity is lost.                                                                                      problems in assuring an optimal parallel-
                                                                                                                                   ism. How to reduce branch penalties has
                                                                                                                                   been discussed in lectures 7&8.
                                                                                                                                - If instructions are of variable length, they
                                                                                                                                   cannot be fetched and issued in parallel;
                                                                                                                                   an instruction has to be decoded in order
                                                                                                                                   to identify the following one and to fetch it
                                                                                                                                   ⇒ superscalar techniques are efficiently
                                                                                                                                   applicable to RISCs, with fixed instruction
                                                                                                                                   length and format.

                                                                                                                           3. Data conflicts:
                                                                                                                                - Data conflicts are produced by data de-
                                                                                                                                   pendencies between instructions in the
                                                                                                                                   program. Because superscalar architec-
                                                                                                                                   tures provide a great liberty in the order in
                                                                                                                                   which instructions can be issued and
                                                                                                                                   completed, data dependencies have to be
                                                                                                                                   considered with much attention.

Petru Eles, IDA, LiTH                                                        Petru Eles, IDA, LiTH
Datorarkitektur                                           Fö 7/8 - 9          Datorarkitektur                                                         Fö 7/8 - 10




                   Limitations on Parallel Execution (cont’d)

                                                                                                 Limitations on Parallel Execution (cont’d)

      Instructions have to be issued as much as possible in
      parallel.
                                                                                        •      Because of data dependencies, only some part of
                                                                                               the instructions are potential subjects for parallel
                                                                                               execution.

          •      Superscalar architectures exploit the potential of
                 instruction level parallelism present in the program.

          •      An important feature of modern superscalar                             •      In order to find instructions to be issued in parallel,
                 architectures is dynamic instruction scheduling:                              the processor has to select from a sufficiently large
                   - instructions are issued for execution dynamical-                          instruction sequence.
                     ly, in parallel and out of order.
                   - out of order issuing: instructions are issued in-
                     dependent of their sequential order, based only
                     on dependencies and availability of resources.

                                                                                        •      A large window of execution is needed

      Results must be identical with those produced by
      strictly sequential execution.



          •      Data dependencies have to be considered carefully




Petru Eles, IDA, LiTH                                                         Petru Eles, IDA, LiTH




      Datorarkitektur                                           Fö 7/8 - 11         Datorarkitektur                                                         Fö 7/8 - 12




                           Window of Execution                                                         Window of Execution (cont’d)

                                                                                    for (i=0; i<last; i++) {
                                                                                           if (a[i] > a[i+1]) {
                                                                                                 temp = a[i];
                                                                                                 a[i] = a[i+1];
          •      Window of execution:                                                            a[i+1] = temp;
                                                                                                 change++;
                 The set of instructions that is considered for                            }
                 execution at a certain moment. Any instruction in                  }
                 the window can be issued for parallel execution,                   ...............................................................................
                 subject to data dependencies and resource                          r7: address of current element (a[i])
                 constraints.
                                                                                    r3: address for access to a[i], a[i+1]
                                                                                    r5: change; r4: last; r6: i
                                                                                    ................................................................................
          •      The number of instructions in the window should be                 L2 move             r3,r7
                 as large as possible.
                 Problems:
                                                                                       lw               r8,(r3)  r8 ← a[i]
                                                                                       add              r3,r3,4                               basic block 1
                  - Capacity to fetch instructions at a high rate
                                                                                       lw               r9,(r3)  r9 ← a[i+1]
                  - The problem of branches
                                                                                       ble              r8,r9,L3
                                                                                               move     r3,r7
                                                                                               sw       r9,(r3)       a[i] ← r9
                                                                                               add      r3,r3,4                               basic block 2
                                                                                               sw       r8,(r3)       a[i+1] ← r8
                                                                                               add      r5,r5,1       change++
                                                                                    L3 add              r6,r6,1 i++
                                                                                       add              r7,r7,4                               basic block 3
                                                                                       blt              r6,r4,L2

Petru Eles, IDA, LiTH                                                         Petru Eles, IDA, LiTH
Datorarkitektur                                           Fö 7/8 - 13         Datorarkitektur                                              Fö 7/8 - 14




                         Window of Execution (cont’d)                                                          Data Dependencies


          •      The window of execution is extended over basic                         •      All instructions in the window of execution may
                 block borders by branch prediction                                            begin execution, subject to data dependence (and
                                                                                               resource) constraints.


                         speculative execution



                                                                                        •      Three types of data dependencies can be
                                                                                               identified:
          •      With speculative execution, instructions of the
                 predicted path are entered into the window of                                        1. True data dependency
                 execution.
                                                                                                      2. Output dependency
                 Instructions from the predicted path are executed
                                                                                                                                artificial dependencies
                 tentatively.
                 If the prediction turns out to be correct the state                                  3. Antidependency
                 change produced by these instructions will become
                 permanent and visible (the instructions commit); if
                 not, all effects are removed.




Petru Eles, IDA, LiTH                                                         Petru Eles, IDA, LiTH




      Datorarkitektur                                           Fö 7/8 - 15         Datorarkitektur                                              Fö 7/8 - 16




                            True Data Dependency                                                              True Data Dependency


                                                                                    For the example on slide 12:
          •      True data dependency exists when the output of
                 one instruction is required as an input to a
                 subsequent instruction:

                                                                                    L2 move r3,r7
                        MUL R4,R3,R1 R4 ← R3 * R1
                        --------------                                                         lw           r8,(r3)
                        ADD R2,R4,R5 R2 ← R4 + R5
                                                                                               add          r3,r3,4
          •      True data dependencies are intrinsic features of the
                 user’s program. They cannot be eliminated by                                  lw           r9,(r3)
                 compiler or hardware techniques.
          •      True data dependencies have to be detected and                                ble          r8,r9,L3
                 treated: the addition above cannot be executed
                 before the result of the multiplication is available.
                   - The simplest solution is to stall the adder until
                     the multiplier has finished.
                   - In order to avoid the adder to be stalled, the
                     compiler or hardware can find other instructions
                     which can be executed by the adder until the re-
                     sult of the multiplication is available.




Petru Eles, IDA, LiTH                                                         Petru Eles, IDA, LiTH
Datorarkitektur                                            Fö 7/8 - 17         Datorarkitektur                                                Fö 7/8 - 18




                              Output Dependency
                                                                                                                 Antidependency

          •      An output dependency exists if two instructions are
                 writing into the same location; if the second                           •      An antidependency exists if an instruction uses a
                 instruction writes before the first one, an error                               location as an operand while a following one is
                 occurs:                                                                        writing into that location; if the first one is still using
                                                                                                the location when the second one writes into it, an
                        MUL R4,R3,R1 R4 ← R3 * R1                                               error occurs:
                        --------------
                        ADD R4,R2,R5 R4 ← R2 + R5                                                       MUL R4,R3,R1 R4 ← R3 * R1
                                                                                                        --------------
                                                                                                        ADD R3,R2,R5 R3 ← R2 + R5
      For the example on slide 12:

                                                                                     For the example on slide 12:
      L2 move r3,r7
                                                                                     L2 move r3,r7
                 lw       r8,(r3)
                                                                                                lw          r8,(r3)
                 add      r3,r3,4
                                                                                                add         r3,r3,4
                 lw       r9,(r3)
                                                                                                lw          r9,(r3)
                 ble      r8,r9,L3
                                                                                                ble         r8,r9,L3




Petru Eles, IDA, LiTH                                                          Petru Eles, IDA, LiTH




      Datorarkitektur                                            Fö 7/8 - 19         Datorarkitektur                                                Fö 7/8 - 20




      The Nature of Output Dependency and Antidependency                                               The Nature of Output Dependency and
                                                                                                             Antidependency (cont’d)

          •      Output dependencies and antidependencies are
                 not intrinsic features of the executed program; they                Example from slide 12:
                 are not real data dependencies but storage
                 conflicts.
          •      Output dependencies and antidependencies are
                 only the consequence of the manner in which the                                            L2 move      r3,r7
                 programmer or the compiler are using registers (or                                            lw        r8,(r3)
                 memory locations). They are produced by the com-                                              add       r3,r3,4
                 petition of several instructions for the same register.                                       lw        r9,(r3)
          •      In the previous examples the conflicts are produced                                            ble       r8,r9,L3
                 only because:
                   - the output dependency: R4 is used by both in-
                      structions to store the result;
                   - the antidependency: R3 is used by the second
                      instruction to store the result;
          •      The examples could be written without
                 dependencies by using additional registers:
                                                                                                            L2 move      R1,r7
                        MUL R4,R3,R1 R4 ← R3 * R1                                                              lw        r8,(R1)
                        --------------                                                                         add       R2,R1,4
                        ADD R7,R2,R5 R7 ← R2 + R5                                                              lw        r9,(R2)
      and                                                                                                      ble       r8,r9,L3
                        MUL R4,R3,R1 R4 ← R3 * R1
                        --------------
                        ADD R6,R2,R5 R6 ← R2 + R5




Petru Eles, IDA, LiTH                                                          Petru Eles, IDA, LiTH
Datorarkitektur                                                Fö 7/8 - 21         Datorarkitektur                                                Fö 7/8 - 22




              Policies for Parallel Instruction Execution
                                                                                            Policies for Parallel Instruction Execution (cont’d)


          •      The ability of a superscalar processor to execute
                 instructions in parallel is determined by:                                  •      The simplest policy is to execute and complete
                                                                                                    instructions in their sequential order. This, however,
                                                                                                    gives little chances to find instructions which can be
                        1. the number and nature of parallel pipelines
                                                                                                    executed in parallel.
                           (this determines the number and nature of in-
                           structions that can be fetched and executed at                    •      In order to improve parallelism the processor has to
                           the same time);                                                          look ahead and try to find independent instructions
                                                                                                    to execute in parallel.
                        2. the mechanism that the processor uses to find
                           independent instructions (instructions that can
                           be executed in parallel).

                                                                                                    Instructions will be executed in an order different
                                                                                                    from the strictly sequential one, with the
                                                                                                    restriction that the result must be correct.
          •      The policies used for instruction execution are
                 characterized by the following two factors:

                        1. the order in which instructions are issued for                    •      Execution policies:
                           execution;                                                                1. In-order issue with in-order completion.
                        2. the order in which instructions are completed                             2. In-order issue with out-of-order completion.
                           (they write results into registers and memory                             3. Out-of-order issue with out-of-order completion.
                           locations).




Petru Eles, IDA, LiTH                                                              Petru Eles, IDA, LiTH




      Datorarkitektur                                                Fö 7/8 - 23         Datorarkitektur                                                Fö 7/8 - 24




                                                                                                  In-Order Issue with In-Order Completion
         Policies for Parallel Instruction Execution (cont’d)
                                                                                             •      Instructions are issued in the exact order that would
                                                                                                    correspond to sequential execution;
                                                                                                    results are written (completion) in the same order.
      We consider the following instruction sequence:

      I1:     ADDF           R12,R13,R14    R12 ← R13 + R14 (float. pnt.)                                   - An instruction cannot be issued before the pre-
      I2:     ADD            R1,R8,R9       R1 ← R8 + R9                                                     vious one has been issued;
      I3:     MUL            R4,R2,R3       R4 ← R2 * R3                                                   - An instruction completes only after the previous
                                                                                                             one has completed.
      I4:     MUL            R5,R6,R7       R5 ← R6 * R7
      I5:     ADD            R10,R5,R7      R10 ← R5 + R7                                  Decode/                                      Writeback/
      I6:     ADD            R11,R2,R3      R11 ← R2 + R3                                                                Execute                       Cycle
                                                                                            Issue                                       Complete
                                                                                            I1             I2                                              1
                                                                                            I3             I4       I1     I2                              2
                                                                                            I5             I6       I1                                     3
          •      I1 requires two cycles to execute;
                                                                                                                                   I3    I1    I2          4
          •      I3 and I4 are in conflict for the same functional unit;
          •      I5 depends on the value produced by I4 (we have a                                                                 I4    I3                5
                 true data dependency between I4 and I5);                                                                  I5            I4                6
          •      I2, I5 and I6 are in conflict for the same functional                                                      I6            I5                7
                 unit;
                                                                                                                                         I6                8


                                                                                                           - To guarantee in-order completion, instruction is-
                                                                                                             suing stalls when there is a conflict and when
                                                                                                             the unit requires more than one cycle to execute;



Petru Eles, IDA, LiTH                                                              Petru Eles, IDA, LiTH
Datorarkitektur                                              Fö 7/8 - 25         Datorarkitektur                                                Fö 7/8 - 26




                                                                                            In-Order Issue with In-Order Completion (cont’d)

           In-Order Issue with In-Order Completion (cont’d)
                                                                                       If the compiler generates this sequence:

                                                                                       I1:     ADDF           R12,R13,R14       R12 ← R13 + R14 (float. pnt.)
                                                                                       I2:     ADD            R1,R8,R9          R1 ← R8 + R9
          •      The processor detects and handles (by stalling)
                 true data dependencies and resource conflicts.                         I6:     ADD            R11,R2,R3         R11 ← R2 + R3
                                                                                       I4:     MUL            R5,R6,R7          R5 ← R6 * R7
          •      As instructions are issued and completed in their                     I5:     ADD            R10,R5,R7         R10 ← R5 + R7
                 strict order, the resulting parallelism is very much                  I3:     MUL            R4,R2,R3          R4 ← R2 * R3
                 dependent on the way the program is written/
                 compiled.                                                             I6-I4 and I5-I3 could be executed in parallel

                                                                                         Decode/                                       Writeback/
                 If I3 and I6 switch position, the pairs I6-I4 and I5-I3                                              Execute                         Cycle
                                                                                          Issue                                        Complete
                 can be executed in parallel (see following slide).
                                                                                          I1             I2                                               1
          •      With superscalar processors we are interested                            I6             I4      I1     I2                                2
                 in techniques which are not compiler based but                           I5             I3      I1                                       3
                 allow the hardware alone to detect instructions
                                                                                                                        I6      I4      I1    I2          4
                 which can be executed in parallel and to issue
                 them.                                                                                                  I5      I3      I6    I4          5
                                                                                                                                        I5    I3          6
                                                                                                                                                          7
                                                                                                                                                          8

                                                                                           •      The sequence needs only 6 cycles instead of 8.


Petru Eles, IDA, LiTH                                                            Petru Eles, IDA, LiTH




      Datorarkitektur                                              Fö 7/8 - 27         Datorarkitektur                                                Fö 7/8 - 28




           In-Order Issue with In-Order Completion (cont’d)                            Out-of-Order Issue with Out-of-Order Completion

          •      With in-order issue&in-order completion the
                 processor has not to bother about output-
                 dependency and antidependency! It has only to                             •      With in-order issue, no new instruction can be
                 detect true data dependencies.                                                   issued when the processor has detected a conflict
                                                                                                  and is stalled, until after the conflict has been
      No one of the two dependencies will be violated if                                          resolved.
      instructions are issued/completed in-order:

      output dependency                                                                           The processor is not allowed to look ahead for
                                                                                                  further instructions, which could be executed in
                                                                                                  parallel with the current ones.
                        MUL R4,R3,R1 R4 ← R3 * R1
                        --------------
                                                                                           •      Out-of-order issue tries to resolve the above
                        ADD R4,R2,R5 R4 ← R2 + R5                                                 problem. Taking the set of decoded instructions the
                                                                                                  processor looks ahead and issues any instruction,
                                                                                                  in any order, as long as the program execution is
      Antidependency                                                                              correct.


                        MUL R4,R3,R1 R4 ← R3 * R1
                        --------------
                        ADD R3,R2,R5 R3 ← R2 + R5




Petru Eles, IDA, LiTH                                                            Petru Eles, IDA, LiTH
Datorarkitektur                                              Fö 7/8 - 29         Datorarkitektur                                          Fö 7/8 - 30




          Out-of-Order Issue with Out-of-Order Completion                                  Out-of-Order Issue with Out-of-Order Completion
                              (cont’d)                                                                         (cont’d)


                                                                                           •      With out-of-order issue&out-of-order completion
      We consider the instruction sequence in slide 15.                                           the processor has to bother about true data
                                                                                                  dependency and both about output-dependency
       • I6 can be now issued before I5 and in parallel with
                                                                                                  and antidependency!
          I4; the sequence takes only 6 cycles (compared to
          8 if we have in-order issue&in-order completion and
          to 7 with in-order issue&out-of-order completion).

                                                                                       Output dependency can be violated (the addition
                                                                                       completes before the multiplication):
        Decode/                                  Writeback/
                                  Execute                         Cycle
         Issue                                   Complete
                                                                                                         MUL R4,R3,R1 R4 ← R3 * R1
         I1             I2                                            1
                                                                                                         --------------
         I3             I4   I1     I2                                2
                                                                                                         ADD R4,R2,R5 R4 ← R2 + R5
         I5             I6   I1             I3     I2                 3
                                    I6      I4     I1    I3           4
                                    I5             I4    I6           5                Antidependency can be violated (the operand in R3 is
                                                   I5                 6                used after it has been over-written):
                                                                      7
                                                                                                         MUL R4,R3,R1 R4 ← R3 * R1
                                                                      8
                                                                                                         --------------
                                                                                                         ADD R3,R2,R5 R3 ← R2 + R5




Petru Eles, IDA, LiTH                                                            Petru Eles, IDA, LiTH




      Datorarkitektur                                              Fö 7/8 - 31         Datorarkitektur                                          Fö 7/8 - 32




                             Register Renaming                                                                 Final Comments

          •      Output dependencies and antidependencies can
                 be treated similarly to true data dependencies as                         •      The following main techniques are characteristic for
                 normal conflicts. Such conflicts are solved by                                     superscalar processors:
                 delaying the execution of a certain instruction until it
                                                                                                    1. additional pipelined units which are working in
                 can be executed.
                                                                                                       parallel;
          •      Parallelism could be improved by eliminating output
                                                                                                    2. out-of-order issue&out-of-order completion;
                 dependencies and antidependencies, which are
                 not real data dependencies (see slide 11).                                         3. register renaming.
          •      Output dependencies and antidependencies can
                 be eliminated by automatically allocating new                             •      All of the above techniques are aimed to enhance
                 registers to values, when such a dependency has                                  performance.
                 been detected. This technique is called register
                 renaming.
                                                                                           •      Experiments have shown:
                                                                                                   - without the other techniques, only adding addi-
      The output dependency is eliminated by allocating, for                                         tional units is not efficient;
      example, R6 to the value R2+R5:
                                                                                                   - out-of-order issue is extremely important; it al-
               MUL R4,R3,R1 R4 ← R3 * R1                                                             lows to look ahead for independent instructions;
               --------------                                                                      - register renaming can improve performance
               ADD R4,R2,R5 R4 ← R2 + R5                                                             with more than 30%; in this case performance
              (ADD     R6,R2,R5 R6 ← R2 + R5)                                                        is limited only by true dependencies.
                                                                                                   - it is important to provide a fetching/decoding
                                                                                                     capacity so that the window of execution is
      The same is true for the antidependency below:
                                                                                                     sufficiently large.
              MUL R4,R3,R1 R4 ← R3 * R1
              --------------
              ADD R3,R2,R5 R3 ← R2 + R5
             (ADD R6,R2,R5 R6 ← R2 + R5)



Petru Eles, IDA, LiTH                                                            Petru Eles, IDA, LiTH
Datorarkitektur                                    Fö 7/8 - 33




                        Some Architectures

      PowerPC 604
        • six independent execution units:
            - Branch execution unit
            - Load/Store unit
            - 3 Integer units
            - Floating-point unit
        • in-order issue
        • register renaming
      Power PC 620
        • provides in addition to the 604 out-of-order issue



      Pentium
        • three independent execution units:
            - 2 Integer units
            - Floating point unit
        • in-order issue

      Pentium II
        • provides in addition to the Pentium out-of-order
           issue
        • five instructions can be issued in one cycle




Petru Eles, IDA, LiTH

Más contenido relacionado

La actualidad más candente

Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUBNusrat Mary
 
VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)Pragnya Dash
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with PipeliningAneesh Raveendran
 
Remote core locking-Andrea Lombardo
Remote core locking-Andrea LombardoRemote core locking-Andrea Lombardo
Remote core locking-Andrea LombardoAndrea Lombardo
 
Training Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten ClusteringTraining Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten ClusteringContinuent
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreadingFraboni Ec
 
Cache coherence problem and its solutions
Cache coherence problem and its solutionsCache coherence problem and its solutions
Cache coherence problem and its solutionsMajid Saleem
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization Ganesan Narayanasamy
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsRuhaim Izmeth
 
Performance tuning in mule
Performance tuning in mulePerformance tuning in mule
Performance tuning in muleSon Nguyen
 
Introduction 1
Introduction 1Introduction 1
Introduction 1Yasir Khan
 
Distributed Performance testing by funkload
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkloadAkhil Singh
 
Mule Script Transformer
Mule Script TransformerMule Script Transformer
Mule Script TransformerAnkush Sharma
 

La actualidad más candente (20)

Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
 
VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)
 
VLIW Processors
VLIW ProcessorsVLIW Processors
VLIW Processors
 
Vliw
VliwVliw
Vliw
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with Pipelining
 
Remote core locking-Andrea Lombardo
Remote core locking-Andrea LombardoRemote core locking-Andrea Lombardo
Remote core locking-Andrea Lombardo
 
Lecture1
Lecture1Lecture1
Lecture1
 
Vliw or epic
Vliw or epicVliw or epic
Vliw or epic
 
Training Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten ClusteringTraining Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten Clustering
 
Cache coherence
Cache coherenceCache coherence
Cache coherence
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreading
 
Lecture4
Lecture4Lecture4
Lecture4
 
Cache coherence problem and its solutions
Cache coherence problem and its solutionsCache coherence problem and its solutions
Cache coherence problem and its solutions
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating Systems
 
Performance tuning in mule
Performance tuning in mulePerformance tuning in mule
Performance tuning in mule
 
Introduction 1
Introduction 1Introduction 1
Introduction 1
 
Distributed Performance testing by funkload
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkload
 
Mule Script Transformer
Mule Script TransformerMule Script Transformer
Mule Script Transformer
 
Cache coherence ppt
Cache coherence pptCache coherence ppt
Cache coherence ppt
 

Similar a Superscalar processors

Instruction pipeline
Instruction pipelineInstruction pipeline
Instruction pipelineajay_a
 
Advanced pipelining
Advanced pipeliningAdvanced pipelining
Advanced pipelininga_spac
 
Speculation and Predication in Compilers
Speculation and Predication in CompilersSpeculation and Predication in Compilers
Speculation and Predication in CompilersBeatriz Aguilar Gallo
 
CCNA Lan Redundancy
CCNA Lan RedundancyCCNA Lan Redundancy
CCNA Lan RedundancyNetworkel
 

Similar a Superscalar processors (8)

Instruction pipeline
Instruction pipelineInstruction pipeline
Instruction pipeline
 
Instruction pipelining (i)
Instruction pipelining (i)Instruction pipelining (i)
Instruction pipelining (i)
 
Instruction pipelining (ii)
Instruction pipelining (ii)Instruction pipelining (ii)
Instruction pipelining (ii)
 
Advanced pipelining
Advanced pipeliningAdvanced pipelining
Advanced pipelining
 
Pipelining
PipeliningPipelining
Pipelining
 
Speculation and Predication in Compilers
Speculation and Predication in CompilersSpeculation and Predication in Compilers
Speculation and Predication in Compilers
 
CCNA Lan Redundancy
CCNA Lan RedundancyCCNA Lan Redundancy
CCNA Lan Redundancy
 
Pipelining In computer
Pipelining In computer Pipelining In computer
Pipelining In computer
 

Superscalar processors

  • 1. Datorarkitektur Fö 7/8 - 1 Datorarkitektur Fö 7/8 - 2 What is a Superscalar Architecture? SUPERSCALAR PROCESSORS • A superscalar architecture is one in which several instructions can be initiated simultaneously and executed independently. • Pipelining allows several instructions to be executed at the same time, but they have to be in 1. What is a Superscalar Architecture? different pipeline stages at a given moment. • Superscalar architectures include all features of 2. Superpipelining pipelining but, in addition, there can be several instructions executing simultaneously in the same 3. Features of Superscalar Architectures pipeline stage. They have the ability to initiate multiple instructions during the same clock cycle. 4. Data Dependencies 5. Policies for Parallel Instruction Execution There are two typical approaches today, in order to improve performance: 6. Register Renaming 1. Superpipelining 2. Superscalar Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 7/8 - 3 Datorarkitektur Fö 7/8 - 4 Superpipelining Pipelined execution Clock cycle → 1 2 3 4 5 6 7 8 9 10 11 Instr. i FI DI CO FO EI WO Instr. i+1 FI DI CO FO EI WO Instr. i+2 FI DI CO FO EI WO • Superpipelining is based on dividing the stages of a pipeline into substages and thus increasing the Instr. i+3 FI DI CO FO EI WO number of instructions which are supported by the Instr. i+4 FI DI CO FO EI WO pipeline at a given moment. Instr. i+5 FI DI CO FO EI WO Superpipelined execution • By dividing each stage into two, the clock cycle period τ will be reduced to the half, τ/2; hence, at Clock cycle → 1 2 3 4 5 6 7 8 9 10 11 the maximum capacity, the pipeline produces a Instr. i FI FI DI DI COCO FO FO EI EI WOWO result every τ/2 s. 1 2 1 2 1 2 1 2 1 2 1 2 FI FI DI DI COCO FO FO EI EI WOWO Instr. i+1 1 2 1 2 1 2 1 2 1 2 1 2 Instr. i+2 FI FI DI DI COCO FO FO EI EI WOWO • For a given architecture and the corresponding 1 2 1 2 1 2 1 2 1 2 1 2 instruction set there is an optimal number of Instr. i+3 FI FI DI DI COCO FO FO EI EI WOWO 1 2 1 2 1 2 1 2 1 2 1 2 pipeline stages; increasing the number of stages FI FI DI DI COCO FO FO EI EI WOWO over this limit reduces the overall performance. Instr. i+4 1 2 1 2 1 2 1 2 1 2 1 2 Instr. i+5 FI FI DI DI COCO FO FO EI EI WOWO 1 2 1 2 1 2 1 2 1 2 1 2 • A solution to further improve speed is the Superscalar execution superscalar architecture. Clock cycle → 1 2 3 4 5 6 7 8 9 10 11 Instr. i FI DI CO FO EI WO Instr. i+1 FI DI CO FO EI WO Instr. i+2 FI DI CO FO EI WO Instr. i+3 FI DI CO FO EI WO Instr. i+4 FI DI CO FO EI WO Instr. i+5 FI DI CO FO EI WO Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 2. Datorarkitektur Fö 7/8 - 5 Datorarkitektur Fö 7/8 - 6 Superscalar Architectures Memory Commit point unit • Superscalar architectures allow several instructions Integer Integer Floating unit unit to be issued and completed per clock cycle. Superscalar Architectures (cont’d) • A superscalar architecture consists of a number of pipelines that are working in parallel. Instr. buffer • Depending on the number and kind of parallel units Register issuing Files available, a certain number of instructions can be Instruction executed in parallel. (queues, reservation • In the following example a floating point and two in- teger operations can be issued and executed simul- stations, etc.) Instr. window taneously; each unit is pipelined and can execute Branch pred. Addr. calc. & several operations in different pipeline stages. Fetch & cache Instruction Rename & Decode & Dispatch Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 7/8 - 7 Datorarkitektur Fö 7/8 - 8 Limitations on Parallel Execution Limitations on Parallel Execution (cont’d) • Three categories of limitations have to be considered: 1. Resource conflicts: • The situations which prevent instructions to be - They occur if two or more instructions executed in parallel by a superscalar architecture compete for the same resource (register, are very similar to those which prevent an efficient memory, functional unit) at the same time; execution on any pipelined architecture (see they are similar to structural hazards dis- pipeline hazards - lectures 3, 4). cussed with pipelines. Introducing several parallel pipelined units, superscalar archi- tectures try to reduce a part of possible re- • The consequences of these situations on source conflicts. superscalar architectures are more severe than those on simple pipelines, because the potential of 2. Control (procedural) dependency: parallelism in superscalars is greater and, thus, a - The presence of branches creates major greater opportunity is lost. problems in assuring an optimal parallel- ism. How to reduce branch penalties has been discussed in lectures 7&8. - If instructions are of variable length, they cannot be fetched and issued in parallel; an instruction has to be decoded in order to identify the following one and to fetch it ⇒ superscalar techniques are efficiently applicable to RISCs, with fixed instruction length and format. 3. Data conflicts: - Data conflicts are produced by data de- pendencies between instructions in the program. Because superscalar architec- tures provide a great liberty in the order in which instructions can be issued and completed, data dependencies have to be considered with much attention. Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 3. Datorarkitektur Fö 7/8 - 9 Datorarkitektur Fö 7/8 - 10 Limitations on Parallel Execution (cont’d) Limitations on Parallel Execution (cont’d) Instructions have to be issued as much as possible in parallel. • Because of data dependencies, only some part of the instructions are potential subjects for parallel execution. • Superscalar architectures exploit the potential of instruction level parallelism present in the program. • An important feature of modern superscalar • In order to find instructions to be issued in parallel, architectures is dynamic instruction scheduling: the processor has to select from a sufficiently large - instructions are issued for execution dynamical- instruction sequence. ly, in parallel and out of order. - out of order issuing: instructions are issued in- dependent of their sequential order, based only on dependencies and availability of resources. • A large window of execution is needed Results must be identical with those produced by strictly sequential execution. • Data dependencies have to be considered carefully Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 7/8 - 11 Datorarkitektur Fö 7/8 - 12 Window of Execution Window of Execution (cont’d) for (i=0; i<last; i++) { if (a[i] > a[i+1]) { temp = a[i]; a[i] = a[i+1]; • Window of execution: a[i+1] = temp; change++; The set of instructions that is considered for } execution at a certain moment. Any instruction in } the window can be issued for parallel execution, ............................................................................... subject to data dependencies and resource r7: address of current element (a[i]) constraints. r3: address for access to a[i], a[i+1] r5: change; r4: last; r6: i ................................................................................ • The number of instructions in the window should be L2 move r3,r7 as large as possible. Problems: lw r8,(r3) r8 ← a[i] add r3,r3,4 basic block 1 - Capacity to fetch instructions at a high rate lw r9,(r3) r9 ← a[i+1] - The problem of branches ble r8,r9,L3 move r3,r7 sw r9,(r3) a[i] ← r9 add r3,r3,4 basic block 2 sw r8,(r3) a[i+1] ← r8 add r5,r5,1 change++ L3 add r6,r6,1 i++ add r7,r7,4 basic block 3 blt r6,r4,L2 Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 4. Datorarkitektur Fö 7/8 - 13 Datorarkitektur Fö 7/8 - 14 Window of Execution (cont’d) Data Dependencies • The window of execution is extended over basic • All instructions in the window of execution may block borders by branch prediction begin execution, subject to data dependence (and resource) constraints. speculative execution • Three types of data dependencies can be identified: • With speculative execution, instructions of the predicted path are entered into the window of 1. True data dependency execution. 2. Output dependency Instructions from the predicted path are executed artificial dependencies tentatively. If the prediction turns out to be correct the state 3. Antidependency change produced by these instructions will become permanent and visible (the instructions commit); if not, all effects are removed. Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 7/8 - 15 Datorarkitektur Fö 7/8 - 16 True Data Dependency True Data Dependency For the example on slide 12: • True data dependency exists when the output of one instruction is required as an input to a subsequent instruction: L2 move r3,r7 MUL R4,R3,R1 R4 ← R3 * R1 -------------- lw r8,(r3) ADD R2,R4,R5 R2 ← R4 + R5 add r3,r3,4 • True data dependencies are intrinsic features of the user’s program. They cannot be eliminated by lw r9,(r3) compiler or hardware techniques. • True data dependencies have to be detected and ble r8,r9,L3 treated: the addition above cannot be executed before the result of the multiplication is available. - The simplest solution is to stall the adder until the multiplier has finished. - In order to avoid the adder to be stalled, the compiler or hardware can find other instructions which can be executed by the adder until the re- sult of the multiplication is available. Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 5. Datorarkitektur Fö 7/8 - 17 Datorarkitektur Fö 7/8 - 18 Output Dependency Antidependency • An output dependency exists if two instructions are writing into the same location; if the second • An antidependency exists if an instruction uses a instruction writes before the first one, an error location as an operand while a following one is occurs: writing into that location; if the first one is still using the location when the second one writes into it, an MUL R4,R3,R1 R4 ← R3 * R1 error occurs: -------------- ADD R4,R2,R5 R4 ← R2 + R5 MUL R4,R3,R1 R4 ← R3 * R1 -------------- ADD R3,R2,R5 R3 ← R2 + R5 For the example on slide 12: For the example on slide 12: L2 move r3,r7 L2 move r3,r7 lw r8,(r3) lw r8,(r3) add r3,r3,4 add r3,r3,4 lw r9,(r3) lw r9,(r3) ble r8,r9,L3 ble r8,r9,L3 Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 7/8 - 19 Datorarkitektur Fö 7/8 - 20 The Nature of Output Dependency and Antidependency The Nature of Output Dependency and Antidependency (cont’d) • Output dependencies and antidependencies are not intrinsic features of the executed program; they Example from slide 12: are not real data dependencies but storage conflicts. • Output dependencies and antidependencies are only the consequence of the manner in which the L2 move r3,r7 programmer or the compiler are using registers (or lw r8,(r3) memory locations). They are produced by the com- add r3,r3,4 petition of several instructions for the same register. lw r9,(r3) • In the previous examples the conflicts are produced ble r8,r9,L3 only because: - the output dependency: R4 is used by both in- structions to store the result; - the antidependency: R3 is used by the second instruction to store the result; • The examples could be written without dependencies by using additional registers: L2 move R1,r7 MUL R4,R3,R1 R4 ← R3 * R1 lw r8,(R1) -------------- add R2,R1,4 ADD R7,R2,R5 R7 ← R2 + R5 lw r9,(R2) and ble r8,r9,L3 MUL R4,R3,R1 R4 ← R3 * R1 -------------- ADD R6,R2,R5 R6 ← R2 + R5 Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 6. Datorarkitektur Fö 7/8 - 21 Datorarkitektur Fö 7/8 - 22 Policies for Parallel Instruction Execution Policies for Parallel Instruction Execution (cont’d) • The ability of a superscalar processor to execute instructions in parallel is determined by: • The simplest policy is to execute and complete instructions in their sequential order. This, however, gives little chances to find instructions which can be 1. the number and nature of parallel pipelines executed in parallel. (this determines the number and nature of in- structions that can be fetched and executed at • In order to improve parallelism the processor has to the same time); look ahead and try to find independent instructions to execute in parallel. 2. the mechanism that the processor uses to find independent instructions (instructions that can be executed in parallel). Instructions will be executed in an order different from the strictly sequential one, with the restriction that the result must be correct. • The policies used for instruction execution are characterized by the following two factors: 1. the order in which instructions are issued for • Execution policies: execution; 1. In-order issue with in-order completion. 2. the order in which instructions are completed 2. In-order issue with out-of-order completion. (they write results into registers and memory 3. Out-of-order issue with out-of-order completion. locations). Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 7/8 - 23 Datorarkitektur Fö 7/8 - 24 In-Order Issue with In-Order Completion Policies for Parallel Instruction Execution (cont’d) • Instructions are issued in the exact order that would correspond to sequential execution; results are written (completion) in the same order. We consider the following instruction sequence: I1: ADDF R12,R13,R14 R12 ← R13 + R14 (float. pnt.) - An instruction cannot be issued before the pre- I2: ADD R1,R8,R9 R1 ← R8 + R9 vious one has been issued; I3: MUL R4,R2,R3 R4 ← R2 * R3 - An instruction completes only after the previous one has completed. I4: MUL R5,R6,R7 R5 ← R6 * R7 I5: ADD R10,R5,R7 R10 ← R5 + R7 Decode/ Writeback/ I6: ADD R11,R2,R3 R11 ← R2 + R3 Execute Cycle Issue Complete I1 I2 1 I3 I4 I1 I2 2 I5 I6 I1 3 • I1 requires two cycles to execute; I3 I1 I2 4 • I3 and I4 are in conflict for the same functional unit; • I5 depends on the value produced by I4 (we have a I4 I3 5 true data dependency between I4 and I5); I5 I4 6 • I2, I5 and I6 are in conflict for the same functional I6 I5 7 unit; I6 8 - To guarantee in-order completion, instruction is- suing stalls when there is a conflict and when the unit requires more than one cycle to execute; Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 7. Datorarkitektur Fö 7/8 - 25 Datorarkitektur Fö 7/8 - 26 In-Order Issue with In-Order Completion (cont’d) In-Order Issue with In-Order Completion (cont’d) If the compiler generates this sequence: I1: ADDF R12,R13,R14 R12 ← R13 + R14 (float. pnt.) I2: ADD R1,R8,R9 R1 ← R8 + R9 • The processor detects and handles (by stalling) true data dependencies and resource conflicts. I6: ADD R11,R2,R3 R11 ← R2 + R3 I4: MUL R5,R6,R7 R5 ← R6 * R7 • As instructions are issued and completed in their I5: ADD R10,R5,R7 R10 ← R5 + R7 strict order, the resulting parallelism is very much I3: MUL R4,R2,R3 R4 ← R2 * R3 dependent on the way the program is written/ compiled. I6-I4 and I5-I3 could be executed in parallel Decode/ Writeback/ If I3 and I6 switch position, the pairs I6-I4 and I5-I3 Execute Cycle Issue Complete can be executed in parallel (see following slide). I1 I2 1 • With superscalar processors we are interested I6 I4 I1 I2 2 in techniques which are not compiler based but I5 I3 I1 3 allow the hardware alone to detect instructions I6 I4 I1 I2 4 which can be executed in parallel and to issue them. I5 I3 I6 I4 5 I5 I3 6 7 8 • The sequence needs only 6 cycles instead of 8. Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 7/8 - 27 Datorarkitektur Fö 7/8 - 28 In-Order Issue with In-Order Completion (cont’d) Out-of-Order Issue with Out-of-Order Completion • With in-order issue&in-order completion the processor has not to bother about output- dependency and antidependency! It has only to • With in-order issue, no new instruction can be detect true data dependencies. issued when the processor has detected a conflict and is stalled, until after the conflict has been No one of the two dependencies will be violated if resolved. instructions are issued/completed in-order: output dependency The processor is not allowed to look ahead for further instructions, which could be executed in parallel with the current ones. MUL R4,R3,R1 R4 ← R3 * R1 -------------- • Out-of-order issue tries to resolve the above ADD R4,R2,R5 R4 ← R2 + R5 problem. Taking the set of decoded instructions the processor looks ahead and issues any instruction, in any order, as long as the program execution is Antidependency correct. MUL R4,R3,R1 R4 ← R3 * R1 -------------- ADD R3,R2,R5 R3 ← R2 + R5 Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 8. Datorarkitektur Fö 7/8 - 29 Datorarkitektur Fö 7/8 - 30 Out-of-Order Issue with Out-of-Order Completion Out-of-Order Issue with Out-of-Order Completion (cont’d) (cont’d) • With out-of-order issue&out-of-order completion We consider the instruction sequence in slide 15. the processor has to bother about true data dependency and both about output-dependency • I6 can be now issued before I5 and in parallel with and antidependency! I4; the sequence takes only 6 cycles (compared to 8 if we have in-order issue&in-order completion and to 7 with in-order issue&out-of-order completion). Output dependency can be violated (the addition completes before the multiplication): Decode/ Writeback/ Execute Cycle Issue Complete MUL R4,R3,R1 R4 ← R3 * R1 I1 I2 1 -------------- I3 I4 I1 I2 2 ADD R4,R2,R5 R4 ← R2 + R5 I5 I6 I1 I3 I2 3 I6 I4 I1 I3 4 I5 I4 I6 5 Antidependency can be violated (the operand in R3 is I5 6 used after it has been over-written): 7 MUL R4,R3,R1 R4 ← R3 * R1 8 -------------- ADD R3,R2,R5 R3 ← R2 + R5 Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 7/8 - 31 Datorarkitektur Fö 7/8 - 32 Register Renaming Final Comments • Output dependencies and antidependencies can be treated similarly to true data dependencies as • The following main techniques are characteristic for normal conflicts. Such conflicts are solved by superscalar processors: delaying the execution of a certain instruction until it 1. additional pipelined units which are working in can be executed. parallel; • Parallelism could be improved by eliminating output 2. out-of-order issue&out-of-order completion; dependencies and antidependencies, which are not real data dependencies (see slide 11). 3. register renaming. • Output dependencies and antidependencies can be eliminated by automatically allocating new • All of the above techniques are aimed to enhance registers to values, when such a dependency has performance. been detected. This technique is called register renaming. • Experiments have shown: - without the other techniques, only adding addi- The output dependency is eliminated by allocating, for tional units is not efficient; example, R6 to the value R2+R5: - out-of-order issue is extremely important; it al- MUL R4,R3,R1 R4 ← R3 * R1 lows to look ahead for independent instructions; -------------- - register renaming can improve performance ADD R4,R2,R5 R4 ← R2 + R5 with more than 30%; in this case performance (ADD R6,R2,R5 R6 ← R2 + R5) is limited only by true dependencies. - it is important to provide a fetching/decoding capacity so that the window of execution is The same is true for the antidependency below: sufficiently large. MUL R4,R3,R1 R4 ← R3 * R1 -------------- ADD R3,R2,R5 R3 ← R2 + R5 (ADD R6,R2,R5 R6 ← R2 + R5) Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 9. Datorarkitektur Fö 7/8 - 33 Some Architectures PowerPC 604 • six independent execution units: - Branch execution unit - Load/Store unit - 3 Integer units - Floating-point unit • in-order issue • register renaming Power PC 620 • provides in addition to the 604 out-of-order issue Pentium • three independent execution units: - 2 Integer units - Floating point unit • in-order issue Pentium II • provides in addition to the Pentium out-of-order issue • five instructions can be issued in one cycle Petru Eles, IDA, LiTH