Outline
• What is pipelining?
• The basic pipeline for a RISC instruction
set
• The major hurdle of pipelining – pipeline
hazards
– Structural hazards
– Data hazards
– Control hazards
2
Pipelining: It’s Natural!
Laundry Example
A, B, C and D
each have one load of clothes
to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes
4
A B C D
Sequential Laundry
5
6 PM 7 8 9 10 11 Midnight
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
T
a
s
k
O
r
d
e
r
Time
•Sequential laundry takes 6 hours for 4 loads
•If they learned pipelining, how long would laundry take?
Pipelined Laundry: Start work ASAP
6
A
B
C
D
6 PM 7 8 9 10 11 Midnight
T
a
s
k
O
r
d
e
r
Time
30 40 40 40 40 20
• Pipelined laundry takes
3.5 hours for 4 loads
Pipelining Lessons
• Pipelining does not help
latency of single task, it
helps throughput of entire
workload
• Pipeline rate limited by
slowest pipeline stage
• Multiple tasks operating
simultaneously
• Potential speedup =
Number of pipe stages
– Unbalanced lengths of
pipe stages reduces
speedup
– Time to fill pipeline and
time to drain it reduces
speedup
7
A
B
C
D
6 PM 7 8 9
T
a
s
k
O
r
d
e
r
Time
30 40 40 40 40 20
What is Pipelining?
• Pipelining is an implementation technique whereby multiple
instructions are overlapped in execution.
• In a computer pipeline, each step in the ―pipe-line‖ completes a
part of an instruction. Each of these steps is called a pipe stage or a
pipe segment.
• The stages are connected one to the next to form a pipe-instructions
enter at one end, progress through the stages, and exit at the other
end, just as cars would in an assembly line.
• The throughput of an instruction pipeline is determined by how
often an instruction exits the pipeline.
• The time required between moving an instruction one step down the
pipeline is a processor cycle.
8
What is Pipelining ? (Cont.)
• Because all stages proceed at the same time, the length
of a processor cycle is determined by the time required
for the slowest pipe stage.
• In a computer, this processor cycle is usually 1 clock
cycle (sometimes it is 2, rarely more).
• If the stages are perfectly balanced, then the time per
instruction on the pipelined processor—assuming ideal
conditions—is equal to
9
StagesPipeofNumber
MachinedUnpipelineonnInstructioPerTime
___
_____
MIPS Architecture
• RISC, load-store architecture, simple address
• 32-bit instructions, fixed format
• 32 64-bit GPRs, R0-R31.
– Really, only 31 – R0 is just a constant 0.
• 32 64-bit FPRs, F0-F31
– Can hold 32-bit floats also (with other ½ unused).
– “SIMD” extensions operate on more floats in 1 FPR
• A few special registers
– Floating-point status register
• Load/store 8-, 16-, 32-, 64-bit integers
– All sign-extended to fill 64-bit GPR
– Also 32- bit floats/doubles
10
MIPS Addressing Modes
• Register (arith./logical ops only)
• Immediate (arith./logical only) & Displacement
(load/stores only)
– 16-bit immediate / offset field
– Register indirect: use 0 as displacement offset
– Direct (absolute): use R0 as displacement base
• Byte-addressed memory, 64-bit address
• Software-settable big-endian/little-endian flag
• Alignment required
11
Start with Unpipelined RISC (MIPS)
• Every instruction can be executed in 5 steps
– Every instructions takes at most 5 clock cycles
• Each step outputs just passed to next step (no latches)
12
Implementation of RISC Instruction Set
• Implementing the instruction set requires the
introduction of several temporary registers that are
not part of the architecture.
• Every instruction takes at most 5 clock cycles:
1. IF - instruction fetch
2. ID - instruction decode and register fetch
3. EX - execution/effective address
4. MEM - memory access/ branch completion
5. WB - write back
13
Simple MIPS Pipeline
Stages now get executed 1 per cycle
› Ideal result is the CPI reduced from 5 to 1
› Is it really this simple? Of course not but it’s a start
Different operations use the same resource on the same
cycle?
Structure Hazard!!
Separate instruction and data memories (IM, DM)
Register files: read in ID and write in WB (distinct use)
› Write PC in IF and write either the incremented PC or
the value of the branch target of an earlier branch
(branch handling problem)
Registers are needed between two adjacent
stages for storing intermediate results
› Otherwise, they will be overwritten by next instruction)
15
17
A pipeline can be thought of as a series of data paths (resources) shifted in time
Read Write
Perform register write/read
in the first/second half of CC
A pipeline showing the pipeline registers
between successive pipeline stages
18
Important Pipeline Characteristics
Latency
› Time it takes for an instruction to go through
the pipe
› Latency = # stages x stage-delay
› Dominant feature if there are lots of
exceptions
Throughput
› Determined by the rate at which instructions
can start/finish
› Dominant feature if no exceptions
19
Basic Performance Issues
Pipelining improve CPU instruction throughput
› Does not reduce the execution time of an
individual instruction
› Slightly increase the execution time of an individual
instruction
Overhead in the control of the pipeline
Pipeline register delay + clock skew (Appendix A-10)
Limit the practical depth of a pipeline
› A program runs faster and has lower total
execution time, even though no single instruction
runs faster
20
Pipeline Hazards
• Pipeline hazards prevent the next
instruction in the instruction stream
from execution during its designated
clock cycles
• Hazards reduce the pipeline
performance from the ideal speedup
21
Pipeline Hazards
Structural hazards
› Caused by resource conflict
› Possible to avoid by adding resources – but may be
too costly
Data hazards
› Instruction depends on the results of a previous
instruction in a way that is exposed by the
overlapping of instructions in the pipeline
› Can be mitigated somewhat by a smart compiler
Control hazards
› When the PC does not get just incremented
› Branches and jumps - not too bad
22
Hazards cause Stalls – Two Policy Choices
• How about just stalling all stages
– OK but problem is usually adjacent stage conflicts
– Hence nothing moves and stall condition never clears
– Cheap option but it does not work
• Stall later let earlier progress
– Instructions issued later than the stalled instructions are also
stalled
– Instructions issued earlier than the stalled instructions must
continue
23
Structural Hazards
• If some combination of instructions cannot be accommodated
because of resource conflicts, the machine is said to have a
structural hazard.
– Some functional unit is not fully pipelined
– Some resource has not been duplicated enough to allow all
combinations of instructions in the pipeline to execute
• Single port register file - conflict with multiple stage needs
• Memory fetch - may need one in both IF and MEM stages
• Pipeline stalls instructions until the required unit is available
– A stall is commonly called a pipeline bubble or just bubble
24
Pipeline Stalled for a Structural
Hazard (Another View)
Clock Cycle Number
Inst. 1 2 3 4 5 6 7 8 9 10
i (Load) IF ID EX MEM WB
i+1 IF ID EX MEM WB
i+2 IF ID EX MEM WB
i+3 STALL IF ID EX MEM WB
i+4 IF ID EX MEM WB
i+5 IF ID EX MEM
i+6 IF ID EX
27
Why Would a Designer Allow Structural
Hazard?
• A machine without structural hazards will
always have a lower CPI (if all other
factors are equal)
• Why would a Designer Allow Structural
Hazard?
– Reduce cost
• Pipeline or duplicate all the functional units may be
too costly
28
Introduction
• Data hazards occur when the pipeline changes the order of
read/write accesses to operands so that the order differs from
the order seen sequentially executing instructions on an
unpipelined machine.
– Example: later instructions use a result not having
been produced by an earlier instruction
• Example
– ADD R1, R2, R3
– SUB R4, R1, R5
– AND R6, R1, R7
– OR R8, R1, R9
– XOR R10, R1, R11
30
R1 R2 + R3
R1 gets produced in the first instruction,
and used in every subsequent instruction
The use of the result of ADD in the next three instructions causes a
hazard, since the register is not written until after those instructions
read it…
31
read/write
Forwarding -- also called bypassing,
shorting, short-circuiting
• Key is to keep the ALU result around
• Example
– ADD R1,R2,R3
– SUB R4, R1,R5
• How do we handle this in general?
– Forwarded value can be at ALU output or
Mem stage output
32
ADD produces R1 value at ALU output
SUB needs it again at the ALU input
Forwarding (Cont.)
• Use the previous example
– Forward the result from where ADD produces (EX/MEM
register) to where SUB needs it (ALU input latch)
– Forwarding works as follows:
• ALU result from EX/MEM register is fed back to ALU input latch
• If the forwarding hardware detects the previous ALU operation
has written the register corresponding to a source for the current
ALU operation, control logic selects the forwarding result as the
ALU input rather than the value read from the register file
• Generalization of forward
– Pass a result directly to the functional unit that requires it:
a result is forwarded from the pipeline register
corresponding to the output of one unit to the input of
another
33
Result With Forwarding
34
IM Reg
ALU
DM Reg
IM Reg
ALU
DM Reg
IM Reg
ALU
DM Reg
IM Reg
ALU
IM Reg
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11
Another Forwarding Example
• Example
– ADD R1, R2, R3
– LW R4, 0(R1)
– SW 12(R1), R4
• Forwarding Result – Next Slide
35
When Forwarding Fails
37
DM: LMDMEM[ALUO]
RD: R1LMD
RS:AR1, BR5
ALU: ALUOA-B
RS:AR1, BR7
ALU: ALUOA ANDB
RS:AR1, BR5
ALU: ALUOA OR B
Stalls
• Some latencies can’t be absorbed -- the case
in the previous slide
– Stalls are the result
– Need pipeline interlock circuits
• Detects a hazard and introduces bubbles until the
hazard clears
– CPI for stalled instructions will bloat by the
number of bubbles
• Bubbles cause the forwarding paths to change
• In MIPS, if the instruction after load uses the
load result, one clock-cycle stall will occur!
38
Handling Stalls
• Hardware: Pipeline Interlocks
– Must detect when required data cannot be provided
– Stall stages to create bubble
• Software: pipeline or instruction scheduling
– Performed by a smart compiler
40
Hardware vs. Software
LW RB, B
LW RC, C
ADD RA, RB, RC
SW A, RA
LW RE, E
LW RF, F
SUB RD, RE, RF
SW D, RD
LW RB, B
LW RC, C
LW RE, E
ADD RA, RB, RC
LW RF, F
SW A, RA
SUB RD, RE, RF
SW D, RD
A = B + C; D = E –F
Pipeline Scheduling
Data Hazard Forms
• RAW - read after write
– j reads before i writes - hence j gets wrong old value
– Most common form of data hazard problem
– As we have seen forwarding can overcome this one
• WAW - write after write
– instructions i then j
– j writes before i writes - leaving incorrect value
– can this happen in MIPS? Why?
• WAW can happen only in pipelines that write in more than
one pipe stage (or allow an instruction to proceed even when
a previous instruction is stalled)
41
i occurs before j: program execution order
Data Hazard Forms (Cont.)
• WAR - write after read
– i then j is intended order
– j writes before i reads - i ends up with
incorrect new value
– Is this a Problem in the MIPS? Why?
• May happen only when some instructions write
results early in pipe stages, and others read a
source late in stages
• RAR – read after read
– Not a hazard
42
Introduction
• Control Hazards – How does branch influence the
pipeline?
• Problem is more complex - need 2 things
– Branch target (taken means not PC+4, not taken the
condition fails) (MEM)
– CC valid - in the MIPS case the result of the Zero
detect unit (EX)
– Both happen late in the pipe
• How to deal with branch?
– Stall the pipeline as soon as we detect the branch
(ID), and stall the pipeline until we reach the MEM
stage
• Three-cycle stall
– The first IF is essentially a stall (when taken branch)
– Consider a 30% branch frequency and an ideal CPI of 1…
44
A branch causes a 3-cycle stall in the
MIPS pipeline
Branch instruction IF ID EX MEM WB
Branch successor
(PC+4 or BTA,
depends on CC)
IF Stall Stall IF ID EX MEM WB
Branch successor + 1 IF ID EX MEM WB
Branch successor + 2 IF ID EX MEM
Branch successor + 3 IF ID EX
Branch successor + 4 IF ID
Branch successor + 5 IF
45
Control Hazard Avoidance
• Simplest Scheme
– Freeze pipe until you know the CC and branch target
– Cheap but slow
– Too slow since we’d negate half of the pipeline speedup since 2 or 3
bubbles
• Predict not taken (47% MIPS branches not taken on average)
– Make sure to defer state change (destructive phase) is delayed until you
know whether you guessed right
– If not then back out or flush
• Predict taken (53% MIPS branches taken on average)
– No use in MIPS (target address and branch outcome are known at the
same stage)
• Or let the compiler decide - same options
46
What Makes Pipelining Hard to
Implement?
• Hazards prevent next instruction from executing during its
designated clock cycle.
• Exceptions and interrupts add complexity to the pipelining unit and
decrease its efficiency:
• Used to describe exceptional situations where the normal execution
order of instruction is changed in unexpected ways.
• The terms interrupt, fault, and exception can be used to describe
exceptional situations.
• The occurrence of an event is usually signaled by an interrupt from
either the hardware or the software. Hardware may trigger an
interrupt at any time by sending a signal to the CPU. Software may
trigger an interrupt by executing a special operation called a system
call (exception or trap).
48
What Makes Pipelining Hard to
Implement?
• Examples: I/O device request, Invoking an operating system
service from a user program, Integer arithmetic overflow,
Power failure, page fault, divide error.
• Other instructions in the pipeline can raise exceptions that
may force the CPU to abort the instructions in the pipeline
before they complete.
• Pipeline must be safely shut down and the state saved so
instruction can be restarted in the correct state after the
exception is served.
• When an exception occurs, the pipeline control can take the
following steps to save the pipeline state safely:
• Force a trap instruction into the pipeline on the next IF.
49
Stopping and Restarting Execution
• Until the trap is taken, turn off all writes (WB) for the faulting instruction
and for all instructions that follow in the pipeline; this can be done by
placing zeros into the pipeline latches of all instructions in the pipeline,
starting with the instruction that generates the exception, but not those
that precede that instruction.
• After the exception-handling routine in the OS receives control, it
immediately saves the PC of the faulting instruction (and other PCs).
This value will be used to return from the exception later.
• After the exception has been handled, special instructions return the
processor from the exception by reloading the PCs and restarting the
instruction stream.
• If the pipeline can be stopped so that the instructions just before the
faulting instruction are completed and those after it can be restarted from
scratch, the pipeline is said to have precise exceptions.
50