# 3 Pipelining

3 de Sep de 2016

### 3 Pipelining

1. Pipelining: Basic and Intermediate Concepts 1
2. Outline • What is pipelining? • The basic pipeline for a RISC instruction set • The major hurdle of pipelining – pipeline hazards – Structural hazards – Data hazards – Control hazards 2
3. What Is Pipelining? 3
4. Pipelining: It’s Natural! Laundry Example A, B, C and D each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes 20 minutes 4 A B C D
5. Sequential Laundry 5 6 PM 7 8 9 10 11 Midnight A B C D 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r Time •Sequential laundry takes 6 hours for 4 loads •If they learned pipelining, how long would laundry take?
6. Pipelined Laundry: Start work ASAP 6 A B C D 6 PM 7 8 9 10 11 Midnight T a s k O r d e r Time 30 40 40 40 40 20 • Pipelined laundry takes 3.5 hours for 4 loads
7. Pipelining Lessons • Pipelining does not help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously • Potential speedup = Number of pipe stages – Unbalanced lengths of pipe stages reduces speedup – Time to fill pipeline and time to drain it reduces speedup 7 A B C D 6 PM 7 8 9 T a s k O r d e r Time 30 40 40 40 40 20
8. What is Pipelining? • Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. • In a computer pipeline, each step in the ―pipe-line‖ completes a part of an instruction. Each of these steps is called a pipe stage or a pipe segment. • The stages are connected one to the next to form a pipe-instructions enter at one end, progress through the stages, and exit at the other end, just as cars would in an assembly line. • The throughput of an instruction pipeline is determined by how often an instruction exits the pipeline. • The time required between moving an instruction one step down the pipeline is a processor cycle. 8
9. What is Pipelining ? (Cont.) • Because all stages proceed at the same time, the length of a processor cycle is determined by the time required for the slowest pipe stage. • In a computer, this processor cycle is usually 1 clock cycle (sometimes it is 2, rarely more). • If the stages are perfectly balanced, then the time per instruction on the pipelined processor—assuming ideal conditions—is equal to 9 StagesPipeofNumber MachinedUnpipelineonnInstructioPerTime ___ _____
10. MIPS Architecture • RISC, load-store architecture, simple address • 32-bit instructions, fixed format • 32 64-bit GPRs, R0-R31. – Really, only 31 – R0 is just a constant 0. • 32 64-bit FPRs, F0-F31 – Can hold 32-bit floats also (with other ½ unused). – “SIMD” extensions operate on more floats in 1 FPR • A few special registers – Floating-point status register • Load/store 8-, 16-, 32-, 64-bit integers – All sign-extended to fill 64-bit GPR – Also 32- bit floats/doubles 10
11. MIPS Addressing Modes • Register (arith./logical ops only) • Immediate (arith./logical only) & Displacement (load/stores only) – 16-bit immediate / offset field – Register indirect: use 0 as displacement offset – Direct (absolute): use R0 as displacement base • Byte-addressed memory, 64-bit address • Software-settable big-endian/little-endian flag • Alignment required 11
12. Start with Unpipelined RISC (MIPS) • Every instruction can be executed in 5 steps – Every instructions takes at most 5 clock cycles • Each step outputs just passed to next step (no latches) 12
13. Implementation of RISC Instruction Set • Implementing the instruction set requires the introduction of several temporary registers that are not part of the architecture. • Every instruction takes at most 5 clock cycles: 1. IF - instruction fetch 2. ID - instruction decode and register fetch 3. EX - execution/effective address 4. MEM - memory access/ branch completion 5. WB - write back 13
14. The Basic Pipeline for MIPS 14
15. Simple MIPS Pipeline  Stages now get executed 1 per cycle › Ideal result is the CPI reduced from 5 to 1 › Is it really this simple? Of course not but it’s a start  Different operations use the same resource on the same cycle?  Structure Hazard!!  Separate instruction and data memories (IM, DM)  Register files: read in ID and write in WB (distinct use) › Write PC in IF and write either the incremented PC or the value of the branch target of an earlier branch (branch handling problem)  Registers are needed between two adjacent stages for storing intermediate results › Otherwise, they will be overwritten by next instruction) 15
16. Best Case Pipeline Scenario 16 Fill Drain Stable (5 times throughput)
17. 17 A pipeline can be thought of as a series of data paths (resources) shifted in time Read Write Perform register write/read in the first/second half of CC
18. A pipeline showing the pipeline registers between successive pipeline stages 18
19. Important Pipeline Characteristics Latency › Time it takes for an instruction to go through the pipe › Latency = # stages x stage-delay › Dominant feature if there are lots of exceptions Throughput › Determined by the rate at which instructions can start/finish › Dominant feature if no exceptions 19
20. Basic Performance Issues Pipelining improve CPU instruction throughput › Does not reduce the execution time of an individual instruction › Slightly increase the execution time of an individual instruction  Overhead in the control of the pipeline  Pipeline register delay + clock skew (Appendix A-10)  Limit the practical depth of a pipeline › A program runs faster and has lower total execution time, even though no single instruction runs faster 20
21. Pipeline Hazards • Pipeline hazards prevent the next instruction in the instruction stream from execution during its designated clock cycles • Hazards reduce the pipeline performance from the ideal speedup 21
22. Pipeline Hazards  Structural hazards › Caused by resource conflict › Possible to avoid by adding resources – but may be too costly  Data hazards › Instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline › Can be mitigated somewhat by a smart compiler  Control hazards › When the PC does not get just incremented › Branches and jumps - not too bad 22
23. Hazards cause Stalls – Two Policy Choices • How about just stalling all stages – OK but problem is usually adjacent stage conflicts – Hence nothing moves and stall condition never clears – Cheap option but it does not work • Stall later let earlier progress – Instructions issued later than the stalled instructions are also stalled – Instructions issued earlier than the stalled instructions must continue 23
24. Structural Hazards • If some combination of instructions cannot be accommodated because of resource conflicts, the machine is said to have a structural hazard. – Some functional unit is not fully pipelined – Some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute • Single port register file - conflict with multiple stage needs • Memory fetch - may need one in both IF and MEM stages • Pipeline stalls instructions until the required unit is available – A stall is commonly called a pipeline bubble or just bubble 24
25. Structural Hazard Example 25
26. Remove Structural Hazard 26 (Only load/store/branch use stage 4) No real hazard if inst1 is not a load or store
27. Pipeline Stalled for a Structural Hazard (Another View) Clock Cycle Number Inst. 1 2 3 4 5 6 7 8 9 10 i (Load) IF ID EX MEM WB i+1 IF ID EX MEM WB i+2 IF ID EX MEM WB i+3 STALL IF ID EX MEM WB i+4 IF ID EX MEM WB i+5 IF ID EX MEM i+6 IF ID EX 27
28. Why Would a Designer Allow Structural Hazard? • A machine without structural hazards will always have a lower CPI (if all other factors are equal) • Why would a Designer Allow Structural Hazard? – Reduce cost • Pipeline or duplicate all the functional units may be too costly 28
29. Data Hazards 29
30. Introduction • Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen sequentially executing instructions on an unpipelined machine. – Example: later instructions use a result not having been produced by an earlier instruction • Example – ADD R1, R2, R3 – SUB R4, R1, R5 – AND R6, R1, R7 – OR R8, R1, R9 – XOR R10, R1, R11 30 R1  R2 + R3 R1 gets produced in the first instruction, and used in every subsequent instruction
31. The use of the result of ADD in the next three instructions causes a hazard, since the register is not written until after those instructions read it… 31 read/write
32. Forwarding -- also called bypassing, shorting, short-circuiting • Key is to keep the ALU result around • Example – ADD R1,R2,R3 – SUB R4, R1,R5 • How do we handle this in general? – Forwarded value can be at ALU output or Mem stage output 32 ADD produces R1 value at ALU output SUB needs it again at the ALU input
33. Forwarding (Cont.) • Use the previous example – Forward the result from where ADD produces (EX/MEM register) to where SUB needs it (ALU input latch) – Forwarding works as follows: • ALU result from EX/MEM register is fed back to ALU input latch • If the forwarding hardware detects the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the forwarding result as the ALU input rather than the value read from the register file • Generalization of forward – Pass a result directly to the functional unit that requires it: a result is forwarded from the pipeline register corresponding to the output of one unit to the input of another 33
34. Result With Forwarding 34 IM Reg ALU DM Reg IM Reg ALU DM Reg IM Reg ALU DM Reg IM Reg ALU IM Reg ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11
35. Another Forwarding Example • Example – ADD R1, R2, R3 – LW R4, 0(R1) – SW 12(R1), R4 • Forwarding Result – Next Slide 35
36. 36 AR2 BR3 AO=A+B (Prod. R1) Do Nothing R1AO AR1 BR4 Imm0 AO=A+Imm (Use R1) LMD= Mem[AO] (Prod. R4) R4LMD AR1 BR4 Imm12 AO=A+Imm (Use R1) Mem[AO] B (Use R4)
37. When Forwarding Fails 37 DM: LMDMEM[ALUO] RD: R1LMD RS:AR1, BR5 ALU: ALUOA-B RS:AR1, BR7 ALU: ALUOA ANDB RS:AR1, BR5 ALU: ALUOA OR B
38. Stalls • Some latencies can’t be absorbed -- the case in the previous slide – Stalls are the result – Need pipeline interlock circuits • Detects a hazard and introduces bubbles until the hazard clears – CPI for stalled instructions will bloat by the number of bubbles • Bubbles cause the forwarding paths to change • In MIPS, if the instruction after load uses the load result, one clock-cycle stall will occur! 38
39. Bubbles and new Forwarding Paths 39
40. Handling Stalls • Hardware: Pipeline Interlocks – Must detect when required data cannot be provided – Stall stages to create bubble • Software: pipeline or instruction scheduling – Performed by a smart compiler 40 Hardware vs. Software LW RB, B LW RC, C ADD RA, RB, RC SW A, RA LW RE, E LW RF, F SUB RD, RE, RF SW D, RD LW RB, B LW RC, C LW RE, E ADD RA, RB, RC LW RF, F SW A, RA SUB RD, RE, RF SW D, RD A = B + C; D = E –F Pipeline Scheduling
41. Data Hazard Forms • RAW - read after write – j reads before i writes - hence j gets wrong old value – Most common form of data hazard problem – As we have seen forwarding can overcome this one • WAW - write after write – instructions i then j – j writes before i writes - leaving incorrect value – can this happen in MIPS? Why? • WAW can happen only in pipelines that write in more than one pipe stage (or allow an instruction to proceed even when a previous instruction is stalled) 41 i occurs before j: program execution order
42. Data Hazard Forms (Cont.) • WAR - write after read – i then j is intended order – j writes before i reads - i ends up with incorrect new value – Is this a Problem in the MIPS? Why? • May happen only when some instructions write results early in pipe stages, and others read a source late in stages • RAR – read after read – Not a hazard 42
43. Control Hazards 43
44. Introduction • Control Hazards – How does branch influence the pipeline? • Problem is more complex - need 2 things – Branch target (taken means not PC+4, not taken the condition fails) (MEM) – CC valid - in the MIPS case the result of the Zero detect unit (EX) – Both happen late in the pipe • How to deal with branch? – Stall the pipeline as soon as we detect the branch (ID), and stall the pipeline until we reach the MEM stage • Three-cycle stall – The first IF is essentially a stall (when taken branch) – Consider a 30% branch frequency and an ideal CPI of 1… 44
45. A branch causes a 3-cycle stall in the MIPS pipeline Branch instruction IF ID EX MEM WB Branch successor (PC+4 or BTA, depends on CC) IF Stall Stall IF ID EX MEM WB Branch successor + 1 IF ID EX MEM WB Branch successor + 2 IF ID EX MEM Branch successor + 3 IF ID EX Branch successor + 4 IF ID Branch successor + 5 IF 45
46. Control Hazard Avoidance • Simplest Scheme – Freeze pipe until you know the CC and branch target – Cheap but slow – Too slow since we’d negate half of the pipeline speedup since 2 or 3 bubbles • Predict not taken (47% MIPS branches not taken on average) – Make sure to defer state change (destructive phase) is delayed until you know whether you guessed right – If not then back out or flush • Predict taken (53% MIPS branches taken on average) – No use in MIPS (target address and branch outcome are known at the same stage) • Or let the compiler decide - same options 46
47. Predict-Not-Taken 47 A Stall indeed
48. What Makes Pipelining Hard to Implement? • Hazards prevent next instruction from executing during its designated clock cycle. • Exceptions and interrupts add complexity to the pipelining unit and decrease its efficiency: • Used to describe exceptional situations where the normal execution order of instruction is changed in unexpected ways. • The terms interrupt, fault, and exception can be used to describe exceptional situations. • The occurrence of an event is usually signaled by an interrupt from either the hardware or the software. Hardware may trigger an interrupt at any time by sending a signal to the CPU. Software may trigger an interrupt by executing a special operation called a system call (exception or trap). 48
49. What Makes Pipelining Hard to Implement? • Examples: I/O device request, Invoking an operating system service from a user program, Integer arithmetic overflow, Power failure, page fault, divide error. • Other instructions in the pipeline can raise exceptions that may force the CPU to abort the instructions in the pipeline before they complete. • Pipeline must be safely shut down and the state saved so instruction can be restarted in the correct state after the exception is served. • When an exception occurs, the pipeline control can take the following steps to save the pipeline state safely: • Force a trap instruction into the pipeline on the next IF. 49
50. Stopping and Restarting Execution • Until the trap is taken, turn off all writes (WB) for the faulting instruction and for all instructions that follow in the pipeline; this can be done by placing zeros into the pipeline latches of all instructions in the pipeline, starting with the instruction that generates the exception, but not those that precede that instruction. • After the exception-handling routine in the OS receives control, it immediately saves the PC of the faulting instruction (and other PCs). This value will be used to return from the exception later. • After the exception has been handled, special instructions return the processor from the exception by reloading the PCs and restarting the instruction stream. • If the pipeline can be stopped so that the instructions just before the faulting instruction are completed and those after it can be restarted from scratch, the pipeline is said to have precise exceptions. 50

### Notas del editor

1. Review today, not so fast in future