More Related Content Similar to Advanced Computer Architectures – Part 2.3 (20) More from Vincenzo De Florio (20) Advanced Computer Architectures – Part 2.32. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/2
Course contents
• Basic Concepts
Computer Design
• Computer Architectures for AI
• Computer Architectures in Practice
3. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/3
Computer Design
• Quantitative assessments
• Instruction sets
• Pipelining
Parallelism
4. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Parallelism
• Introduction to parallel processing
• Instruction level parallelism
• (Data level parallelism)
Part 3
• (Task level parallelism)
Part 3
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/4
5. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/5
Parallelism
• Introduction to parallel processing
Basic concepts: granularity, program,
process, thread, language aspects
Types of parallelism
• Instruction level parallelism
6. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/6
Parallelism
• Introduction to parallel processing
Basic concepts: granularity, program,
process, thread
Types of parallelism
• Instruction level parallelism
7. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Granularity
• Definition:
granularity is the complexity/grain size of
some item
e.g. computation item (instruction),
data item (scalar, array, struct),
communication item (token granularity),
hardware building block (gate,
RTL component)
Granularity
Low
CISC (e.g. ld *a0++,r1)
Computer
Architectures
In Practice
High Level Languages HLLs
(e.g. x = sin(y))
High
2.3/7
RISC (e.g. add r1,r2,r4)
Application-specific
(e.g. edge-det.invert.image)
8. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/8
Granularity
• Deciding the granularity is an important
design choice
• E.g. grain size for the communication
tokens in a parallel computer:
coarse grain: less communication overhead
fine grain: less time penalty when two
communication packets compete for
transmission over the same channel and collide
9. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/14
Parallelism
• Introduction to parallel processing
Basic concepts: granularity, program, process,
thread
Types of parallelism
• Instruction level parallelism
10. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/15
Types of parallelism
• Functional parallelism
Important for
the exam!
Different computations have to be performed
on the same or different data
E.g. Multiple users submit jobs to the same
computer or a single user submits multiple jobs
to the same computer
this is functional parallelism at the process level
taken care of at run-time by the OS
11. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/18
Types of parallelism
• Data parallelism
Important for
the exam!
Same computations have to be performed on a
whole set of data
E.g. 2D convolution of an image
This is data parallelism at the loop level:
consecutive loop iterations are candidates for
parallel execution, subject to inter-iteration data
dependencies
Leads often to massive amount of parallelism
12. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Levels of parallelism
• Instruction level parallel (ILP)
Functional parallelism at the instruction level
Example: pipelining
• Data level parallel (DLP)
Data parallelism at the loop level
• Process & thread level parallel (TLP)
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/19
Functional parallelism at the thread and
process level
13. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/20
Parallelism
• Introduction to parallel processing
• Instruction level parallelism
Introduction
VLIW
Advanced pipelining techniques
Super scalar
14. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/21
Parallelism
• Introduction to parallel processing
• Instruction level parallelism
Introduction
VLIW
Advanced pipelining techniques
Super scalar
15. © V. De Florio
KULeuven 2002
Basic
Concepts
Type of Instruction Level
Parallelism utilization
• Sequential instruction issuing, sequential
instruction execution
von Neumann processors
Computer
Design
Instruction word
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/22
EU
16. © V. De Florio
KULeuven 2002
Basic
Concepts
Type of Instruction Level
Parallelism utilization
• Sequential instruction issuing, parallel
instruction execution
pipelined processors
Computer
Design
Instruction word
EU1
Computer
Architectures
for AI
Computer
Architectures
In Practice
EU2
EU3
EU4
2.3/23
17. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Type of Instruction Level
Parallelism utilization
• Parallel instruction issuing –
compile-time determined by compiler,
parallel instruction execution
VLIW processors:
Very Long Instruction Word
Instruction word
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/24
EU1
EU2
EU3
EU4
18. Type of Instruction Level
Parallelism utilization
© V. De Florio
KULeuven 2002
Basic
Concepts
• Parallel instruction issuing – run-time
determined by HW dispatch unit,
parallel instruction execution
super-scalar processors (to be seen later)
Computer
Design
Instruction
window
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/25
EU1
EU2
EU3
EU4
19. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/26
Type of Instruction Level
Parallelism utilization
• Most processors provide sequential
execution semantics
regardless how the processor actually
executes the instructions (sequential or
parallel, in-order or out-of-order), the result
is the same as sequential execution in the
order they were written
• VLIW and IA-64 provide parallel
execution semantics
explicit indication in ASM which
instructions are executed in parallel
20. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/27
Parallelism
• Introduction to parallel processing
• Instruction level parallelism
Introduction
VLIW
Advanced pipelining techniques
Super scalar
21. © V. De Florio
KULeuven 2002
VLIW
Main instruction
memory
Basic
Concepts
128 bit
Instruction Cache
Computer
Design
128 bit
Instruction Register
32 bit each
Dec
Computer
Architectures
for AI
Dec
256 decoded bits each
EU
EU
EU
Register file
EU
32 bit each; 8 read ports, 4 write ports
32 bit each; 2 read ports, 1 write port
32 bit;
1 bi-directional port
2.3/28
Dec
Cache/
RAM
Computer
Architectures
In Practice
Dec
Main data
memory
22. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/29
VLIW
• Properties
Multiple Execution Units: multiple instructions
issued in one clock cycle
Every EU requires 2 operands and delivers one
result every clock cycle: high data memory
bandwidth needed
Careful design of data memory hierarchy
Register file with many ports
Large register file: 64-256 registers
Carefully balanced cache/RAM hierarchy with
decreasing number of ports and increasing
memory size and access time for the higher
levels (IMEC research: DTSE)
23. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/32
VLIW
• Properties
Compiler should determine which instructions
can be issued in a single cycle without control
dependency conflict nor data dependency
conflict
Deterministic utilization of parallelism: good for
hard-real-time
Compile-time analysis of source code: worst case
analysis instead of actual case
Very sophisticated compilers, especially when
the EUs are pipelined! Perform well since early
2000
24. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/33
VLIW
• Properties
Compiler should determine which instructions
can be issued in a single cycle without control
dependency conflict nor data dependency
conflict
Very difficult to write assembly:
programmer should resolve all control flow conflicts
all data flow conflicts
all pipelining conflicts
and at the same time fit data accesses into the
available data memory bandwidth
and all program accesses into the available program
memory bandwidth
e.g. 2 weeks for a sum-of-products (3 lines of Ccode)
All high end DSP processors since 1999 are
VLIW processors (examples: Philips Trimedia -high end TV, TI TMS320C6x -- GSM base
stations and ISP modem arrays)
25. © V. De Florio
KULeuven 2002
Low power DSP
Basic
Concepts
Computer
Design
Main instruction
memory
Too much power
dissipation in
fetching wide
instructions
128 bit
Instruction Cache
128 bit
Computer
Architectures
for AI
Instruction Register
32 bit each
Dec
Computer
Architectures
In Practice
Dec
Dec
Dec
256 decoded bits each
EU
EU
EU
Register file
EU
32 bit each; 8 read ports, 4 write ports
32 bit each; 2 read ports, 1 write port
2.3/34
26. © V. De Florio
KULeuven 2002
Main
IMem
Low power DSP
Basic
Concepts
24 bit
ICache
Computer
Design
24 bit
E.g. ADD4 is expanded into
ADD || ADD || ADD || ADD
Instruction
expansion
128 bit
Computer
Architectures
for AI
Instruction Register
32 bit each
Dec
Computer
Architectures
In Practice
Dec
Dec
Dec
256 decoded bits each
EU
EU
EU
Register file
EU
32 bit each; 8 read ports, 4 write ports
32 bit each; 2 read ports, 1 write port
2.3/35
27. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/36
Low power DSP
• Properties
Power consumption in program memory is
reduced by specializing the instructions for the
application
Not all combinations of all instructions for the
EUs are possible, but only a limited set, i.e.
those combinations that lead to a substantial
speed-up of the application
Those relevant combinations are represented
by the smallest possible amount of bits to
reduce program memory width and hence
program memory power consumption
Can only be done for embedded DSP
applications: processor is specialized for 1
application (examples: TI TMS320C54x -- GSM
mobile phones, TI TMS320C55x -- UMTS mobile
phones)
28. Low power DSP
for interactive
multimedia
© V. De Florio
KULeuven 2002
Main
IMem
Basic
Concepts
ICache
Computer
Design
24 bit
Run-time reconfiguration
allows to adapt specialization
to changing application
requirements
24 bit
Reconfigurable
Instruction expansion
128 bit
Computer
Architectures
for AI
Instruction Register
32 bit each
Dec
Computer
Architectures
In Practice
Dec
Dec
Dec
256 decoded bits each
REU
REU
REU
Register file
REU
32 bit each; 8 read ports, 4 write ports
32 bit each; 2 read ports, 1 write port
2.3/37
29. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/39
Parallelism
• Introduction to parallel processing
• Instruction level parallelism
Introduction
VLIW
Advanced pipelining techniques
Super scalar
30. © V. De Florio
KULeuven 2002
Basic
Concepts
Advanced Pipelining
• Pipeline CPI is the result of many
components
CPUTIME(p) = IC(p) CPI(p)
clock rate
Computer
Design
• A number of techniques act on one or
more of these components:
Computer
Architectures
for AI
Computer
Architectures
In Practice
Loop unrolling
Scoreboarding
Dynamic branch prediction
Speculation
…
• To be seen later
2.3/40
31. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/41
Advanced Pipelining
• Till now, Instruction-level parallelism was
searched within the boundaries of a basic
block (BB)
• A BB is 6-7 instructions on average
too small to reach the expected
performance
• What is worse, there’s a big chance that
these instructions have dependencies
Even less performance can be expected
32. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/42
Advanced Pipelining
• To obtain more, we need to go beyond the
BB limitation:
• We must exploit ILP across multiple BB’s
• Simplest way: loop level parallelism (LLP):
Exploiting the parallelism among iterations of a
loop
• Converting LLP into ILP
Loop unrolling
Statically (compiler-based)
Dynamically (HW-based)
• Using vector instructions
Does not require LLP -> ILP conversion
33. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/43
Advanced Pipelining
• The efficiency of the conversion depends
On the amount of ILP available
On latencies of the functional units in the
pipeline
On the ability to avoid pipeline stalls by
separating dependent instructions by a
“distance” (in terms of stages) equal to the
latency peculiar to the source instruction
LW x, …
INSTR …, x
a load must not be followed by the
immediate use of the load destination
register
34. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/44
Advanced Pipelining
Loop unrolling
Assumptions and steps
1. We assume the following latencies
Consumer
Instruction
Producer
Instruction
Latency
FP ALU OP
FP ALU OP
3
FP ALU OP
S ORE DBL
T
2
LOAD DBL
FP ALU OP
1
LOAD DBL
S ORE DBL
T
0
35. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/45
Advanced Pipelining
Loop unrolling
2. We assume to work with a simple loop
such as
for (I=1; I<=1000; I++)
x[I] = X[I] + s;
• Note: each iteration is independent of
the others
Very simple case
36. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/46
Advanced Pipelining
Loop unrolling
3. Translated in DLX, this simple loop looks
like this:
; assumptions: R1 = &x[1000]
;
F2 = s
Loop: LD
F0, 0(R1) ; F0 = x[I]
ADDD F4, F0, F2 ; F4 = F0 + s
SD
0(R1), F4 ; store result
SUBI R1, R1, #8 ; R1 = R1 - 1
BNEZ R1, Loop ; if (R1)
; goto Loop
W
O
37. Advanced Pipelining
Loop unrolling
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
4. Tracing the loop (no scheduling!):
Loop: LD
stall
ADDD
stall
stall
SD
SUBI
BNEZ
stall
•
2.3/47
F0, 0(R1) ;
F4, F0, F2 ;
0(R1), F4
R1, R1, #8
R1, Loop
;
;
;
;
1
2
3
4
5
6
7
8
9
9 clock cycles per iteration, with 4 stalls
38. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/48
Advanced Pipelining
Loop unrolling
5. With scheduling, we move from
Loop: LD
ADDD
SD
SUBI
BNEZ
to
F0, 0(R1)
F4, F0, F2
0(R1), F4
R1, R1, #8
R1, Loop
Loop: LD
ADDD
SUBI
BNEZ
SD
F0, 0(R1)
F4, F0, F2
R1, R1, #8
R1, Loop
8
8(R1), F4
whose trace shows that less cycles are
wasted:
39. Advanced Pipelining
Loop unrolling
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/49
6. Tracing the loop (with scheduling!):
Loop: LD
stall
ADDD
SUBI
BNEZ
SD
•
•
•
•
F0, 0(R1)
F4, F0, F2
R1, R1, 8
R1, Loop
8(R1), F4
;
;
;
;
;
1
2
3
4
5
6
O
O
6 clock cycles per iteration, with 1 stall
3 stalls less!
Still the useful cycles are just 3
How to gain more?
40. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/50
Advanced Pipelining
Loop unrolling
7. With loop unrolling:
replicating the body of loop multiple
times
Loop: LD
F0, 0(R1)
ADDD F4, F0, F2
SD
0(R1), F4
; skip SUBI and BNEZ
LD
F6, -8(R1)
; F6 vs. F0
ADDD F8, F6, F2
; F8 vs. F4
SD
-8(R1), F8
; skip SUBI and BNEZ
LD
F10, -16(R1) ; F10 vs. F0
ADDD F12, F10, F2 ; F12 vs. F4
SD
-16(R1), F12 ; skip SUBI and BNEZ
LD
F14, -24(R1) ; F14 vs. F0
ADDD F16, F14, F2 ; F16 vs. F4
SD
-24(R1), F16 ; skip SUBI and BNEZ
SUBI R1, R1, #32 ; R1 = R1 – 4
BNEZ R1, Loop
• Spared 3 x (SUBI + BNEZ)
41. © V. De Florio
KULeuven 2002
Basic
Concepts
Advanced Pipelining
Loop unrolling
• Loop unrolling:
replicating the body of loop multiple times
Some branches are eliminated
The ratio w/o increases
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
The BB artificially increases its size
Higher probability of optimal scheduling
Requires a wider set of registers and
adjusting values of load and store
registers
(In the given example,) Every operation is
followed by a dependent instruction
Will cause a stall
Trace of unscheduled unrolled loop: 27 cycles
2 per LD, 3 per ADD, 2 per branch, 1 per any other
2.3/51
6.8 clock cycles per iteration
Pure scheduling is better! (6 cycles)
42. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/52
Advanced Pipelining
Loop unrolling
• Unrolled loop plus scheduling
Loop: LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F4, F0, F2
0(R1), F4
F6, -8(R1)
F8, F6, F2
-8(R1), F8
F10, -16(R1)
F12, F10, F2
-16(R1), F12
F14, -24(R1)
F16, F14, F2
-24(R1), F16
R1, R1, #32
R1, Loop
; skip SUBI and BNEZ
; F6 vs. F0
; F8 vs. F4
; skip SUBI and BNEZ
; F10 vs. F0
; F12 vs. F4
; skip SUBI and BNEZ
; F14 vs. F0
; F16 vs. F4
; skip SUBI and BNEZ
; R1 = R1 – 4
43. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/53
Advanced Pipelining
Loop unrolling
• Unrolled loop plus scheduling
Loop: LD
LD
ADDD
SD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F6, -8(R1)
F4, F0, F2
0(R1), F4
F8, F6, F2
-8(R1), F8
F10, -16(R1)
F12, F10, F2
-16(R1), F12
F14, -24(R1)
F16, F14, F2
-24(R1), F16
R1, R1, #32
R1, Loop
; F6 vs. F0
; skip SUBI and BNEZ
; F8 vs. F4
; skip SUBI and BNEZ
; F10 vs. F0
; F12 vs. F4
; skip SUBI and BNEZ
; F14 vs. F0
; F16 vs. F4
; skip SUBI and BNEZ
; R1 = R1 – 4
44. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/54
Advanced Pipelining
Loop unrolling
• Unrolled loop plus scheduling
Loop: LD
LD
LD
ADDD
SD
ADDD
SD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F6, -8(R1)
F10, -16(R1)
F4, F0, F2
0(R1), F4
F8, F6, F2
-8(R1), F8
F12, F10, F2
-16(R1), F12
F14, -24(R1)
F16, F14, F2
-24(R1), F16
R1, R1, #32
R1, Loop
; F6 vs. F0
; F10 vs. F0
; skip SUBI and BNEZ
; F8 vs. F4
; skip SUBI and BNEZ
; F12 vs. F4
; skip SUBI and BNEZ
; F14 vs. F0
; F16 vs. F4
; skip SUBI and BNEZ
; R1 = R1 – 4
45. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/55
Advanced Pipelining
Loop unrolling
• Unrolled loop plus scheduling
Loop: LD
LD
LD
LD
ADDD
SD
ADDD
SD
ADDD
SD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F6, -8(R1)
F10, -16(R1)
F14, -24(R1)
F4, F0, F2
0(R1), F4
F8, F6, F2
-8(R1), F8
F12, F10, F2
-16(R1), F12
F16, F14, F2
-24(R1), F16
R1, R1, #32
R1, Loop
; F6 vs. F0
; F10 vs. F0
; F14 vs. F0
; skip SUBI and BNEZ
; F8 vs. F4
; skip SUBI and BNEZ
; F12 vs. F4
; skip SUBI and BNEZ
; F16 vs. F4
; skip SUBI and BNEZ
; R1 = R1 – 4
46. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/56
Advanced Pipelining
Loop unrolling
• Unrolled loop plus scheduling
Enough
distance
to prevent
the
dependency
to turn
into a
hazard
Loop: LD
F0, 0(R1)
LD
F6, -8(R1)
; F6 vs. F0
LD
F10, -16(R1) ; F10 vs. F0
LD
F14, -24(R1) ; F14 vs. F0
ADDD F4, F0, F2
ADDD F8, F6, F2
; F8 vs. F4
ADDD F12, F10, F2 ; F12 vs. F4
ADDD F16, F14, F2 ; F16 vs. F4
SD
0(R1), F4
; skip SUBI and BNEZ
SD
-8(R1), F8
; skip SUBI and BNEZ
SD
-16(R1), F12 ; skip SUBI and BNEZ
SD
-24(R1), F16 ; skip SUBI and BNEZ
SUBI R1, R1, #32 ; R1 = R1 – 4
BNEZ R1, Loop
• 14 clock cycles, or 3.5 clock cycles / iteration
47. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/57
Advanced Pipelining
Loop unrolling
• Unrolling the loop exposes more
computation that can be scheduled to
minimize the stalls
• Unrolling increases the BB; as a result, a
better choice can be done for scheduling
• A useful technique with two key
requirements:
Understanding how an instruction depends on
another
Understanding how to change or reorder the
instructions, given the dependencies
• In what follows we concentrate on .
48. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Loop unrolling: . dependencies
•
Again, let ( Ik)1 k IC(p) be the ordered
series of instructions executed during
the run of program p
• Given two instructions, Ii and Ij, with i<j,
we say that
Ij is dependent on Ii
(Ii Ij)
iff
R(Ii) D(Ij)
R is the range and D the domain of a given
instruction
Ii produces a result which is consumed by Ij
or
2.3/58
$ n { 1,…,IC(p)} and $ k1 < k2 < … < kn
such that Ii Ik1 Ik2 .. Ikn Ij
49. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/59
Loop unrolling: . dependencies
• (Ii , Ik1 , Ik2 , … Ikn , Ij) is called a
dependency (transitive) chain
• Note that a dependency chain can be as
long as the entire execution of p
• A hazard implies dependency
• Dependency does not imply a hazard!
• Scheduling tries to place dependent
instructions in places where no hazard can
occur
50. © V. De Florio
KULeuven 2002
Loop unrolling: . dependencies
Basic
Concepts
• For instance:
SUBI R1, R1, #8
Computer
Design
BNEZ R1, Loop
• This is clearly a dependence, but it does
not result in a hazard
Forwarding eliminates the hazard
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/60
• Another example:
LD
F0, 0(R1)
ADDD F4, F0, F2
• This is a data dependency which does
lead to a hazard and a stall
51. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/61
Loop unrolling: . dependencies
• Dealing with data dependencies
• Two classes of methods:
1. Keeping the dependence though avoiding
the hazard (via scheduling)
2. Eliminating a dependence by
transforming the code
52. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/62
Loop unrolling: . dependencies
• Class 2 implies more work
• These are optimization methods used by
the compilers
• Detecting dependencies when only using
registers is easy; the difficulties come
from detecting dependencies in memory:
• For instance 100(R4) and 20(R6) may
point to the same memory location
• Also the opposite situation may take
place:
LD 20(R4), R2
…
ADD R3, R1, 20(R4)
• If R4 changes, this is no dependency
53. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/63
Loop unrolling: . dependencies
• Ii Ij means that
Ii produces a result that is consumed by Ij
• When there is no such production, e.g.,
Ii and Ij are both loads or stores, we call
this a name dependency
• Two types of name dependencies:
Antidependence
Corresponds to WAR hazards
Ij x ; Ii x (reordering implies an error)
Output dependence
Corresponds to WAW hazards
Ij x ; Ii x (reordering implies an error)
• No value is transferred between the
instructions
• Register renaming solves the problem
54. © V. De Florio
KULeuven 2002
Basic
Concepts
Loop unrolling: . dependencies
•
•
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/64
•
Register renaming: if the register name is
changed, the conflict disappears
This technique can be either static (and
done by the compiler) or dynamic (done
by the HW)
Let us consider again the following loop:
Loop: LD
F0, 0(R1)
ADDD F4, F0, F2
SD
0(R1), F4
SUBI R1, R1, #8
BNEZ R1, Loop
• Let us perform unrolling w/o renaming:
55. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/65
Loop unrolling: . dependencies
Loop: LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F4, F0, F2
0(R1), F4
F0, -8(R1)
F4, F0, F2
-8(R1), F4
F0, -16(R1)
F4, F0, F2
-16(R1), F4
F0, -24(R1)
F4, F0, F2
-24(R1), F0
R1, R1, #32
R1, Loop
The yellow arrows
are name dependencies. To solve
them, we perform
renaming
56. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/66
Loop unrolling: . dependencies
Loop: LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F4, F0, F2
0(R1), F4
F6, -8(R1)
F8, F6, F2
-8(R1), F8
F0, -16(R1)
F4, F0, F2
-16(R1), F4
F0, -24(R1)
F4, F0, F2
-24(R1), F0
R1, R1, #32
R1, Loop
57. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/67
Loop unrolling: . dependencies
Loop: LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F4, F0, F2
0(R1), F4
F6, -8(R1)
F8, F6, F2
-8(R1), F8
F10, -16(R1)
F12, F10, F2
-16(R1), F12
F14,
-24(R1)
F16, F14, F2
-24(R1), F16
R1, R1, #32
R1, Loop
58. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/68
Loop unrolling: . dependencies
Loop: LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F4, F0, F2
0(R1), F4
F6, -8(R1)
F8, F6, F2
-8(R1), F8
F10, -16(R1)
F12, F10, F2
-16(R1), F12
F14,
-24(R1)
F16, F14, F2
-24(R1), F16
R1, R1, #32
R1, Loop
The yellow arrows
are data dependencies. To solve
them, we reorder
the instructions
59. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/69
Loop unrolling: . dependencies
• A third class of dependencies is the one
of control dependencies
• Examples:
if (p1) s1;
if (p2) s2;
then
p1 c s1 (s1 is control dependent on p1)
p2 c s2 (s2 is control dependent on p2)
• Clearly (p1 c s2) , that is,
s2 is not control dependent on p1
60. © V. De Florio
KULeuven 2002
Basic
Concepts
Loop unrolling: . dependencies
• Two properties are critical to control
dependency:
Exception behaviour
Data flow
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/72
• Exception behaviour: suppose we have
the following excerpt:
BEQZ
R2, L1
DIVI
R1, 8(R2)
L1: …
• We may be able to move the DIVI to
before the BEQZ without violating the
sequential semantics of the program
• Suppose the branch is taken. Normally
one would simply need to undo the DIVI
• What if DIVI triggers a DIVBYZERO
exception?
61. © V. De Florio
KULeuven 2002
Basic
Concepts
Loop unrolling: . dependencies
• Two properties are critical to control
dependency:
Exception behaviour
Data flow
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/73
• Data flow must be preserved
• Let us consider the following excerpt:
ADD
R1, R2, R3
BEQZ
R4, L
SUB
R1, R5, R6
L:
OR
R7, R1, R8
• Value of R1 depends on the control flow
• The OR depends on both ADD and SUB
• Also depends on the nature of the branch
• R1 = (taken)? ADD.. : SUB..
62. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/74
Loop Level Parallelism
• Let us consider the following loop:
for (I=1; I<=100; I++) {
A[I+1] = A[I] + C[I]; /* S1 */
B[I+1] = B[I] + A[I+1]; /* S2 */ }
• S1 is a loop-carried dependency (LCD):
iteration I+1 is dependent on iteration I:
A’ = f(A)
• S2 is
B’ = f(B,A’)
• If a loop has only non-LCD’s, then it is
possible to execute more than one loop
iteration in parallel – as long as the
dependencies within each iteration are
not violated
63. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/75
Loop Level Parallelism
• What to do in the presence of LCD’s?
• Loop transformations. Example:
for (I=1; I<=100; I++) {
A[I+1] = A[I] + B[I]; /* S1 */
B[I+1] = C[I] + D[I]; /* S2 */ }
•
A’ = f(A, B)
B’ = f(C, D)
• Note: no dependencies except LCD’s
Instructions can be swapped!
64. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/76
Loop Level Parallelism
• What to do in the presence of LCD’s?
• Loop transformations. Example:
for (I=1; I<=100; I++) {
A[I+1] = A[I] + B[I]; /* S1 */
B[I+1] = C[I] + D[I]; /* S2 */ }
• Note: the flow, i.e.,
A0 B0
A0 B0
C0 D0
C0 D0
A1 B1
can be
A1 B1
C1 D1 changed
into
C1 D1
A2 B2
A2 B2
C2 D2
...
...
65. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/77
Loop Level Parallelism
for (i=1; i <= 100; i=i+1) {
A[i] = A[i] + B[i];
/* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
becomes
A[1] = A[1] + B[1];
for (i=1; i <= 99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];
66. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/78
Loop Level Parallelim
• A’ = f(A, B)
B’ = f(C, D)
B’ = f(C, D)
A’ = f(A’, B’)
• Now we have dependencies but no more
LCD’s!
It is possible to execute more than one
loop iteration in parallel – as long as the
dependencies within each iteration are
not violated
67. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/79
Dependency avoidance
1. “Batch” approaches: at compile time, the
compiler schedules the instructions in
order to minimize the dependencies
(static scheduling)
2. “Interactive” approaches: at run-time, the
HW rearranges the instructions in order
to minimize the stalls (dynamic
scheduling)
• Advantages of 2:
Only approach when dependencies are only
known at run-time (pointers etc.)
The compiler can be simpler
Given an executable compiled for a machine
with machine-level X and pipeline organization
Y, it can run efficiently on another machine
with the same machine level but a different
pipeline organization Z
68. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/80
Dynamic Scheduling
• Static scheduling: compiler techniques
for scheduling (rearranging) the
instructions
so to separate dependent instructions
And hence minimize unsolvable hazards
causing unavoidable stalls
• Dynamic scheduling: HW-based, run-time
techniques
• A dynamically scheduled processor does
not try to remove true data dependencies
(which would be impossible): it tries to
avoid stalling when dependencies are
present
• The two techniques can be both used
69. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/81
Dynamic Scheduling: General Idea
• If an instruction is stalled in the pipeline,
no later instruction can proceed
• A dependence between two instructions
close to each other causes a stall
• A stall means that, even though there
may be idle functional units that could
potentially serve other instructions, those
units have to stay idle
• Example:
DIVD F0, F2, F4
ADDD F10, F0, F8
SUBD F12, F8, F14
• ADDD depends on DIVD; but SUBD does
not. Despite this, it is not issued!
70. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/82
Dynamic Scheduling: General Idea
• So SUBD is not issued even there might
be a functional unit ready to perform the
requested operation
• Big performance limitation!
• What are the reasons that lead to this
problem?
• In-order instruction issuing and
execution: instructions issue and execute
one at a time, one after the other
71. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/83
Dynamic Scheduling: General Idea
•
Example: in DLX, the issue of an
instruction occurs at ID (instruction
decode)
• In DLX, ID checks for absence of
structural hazards and waits for the
absence of data hazards
• These two steps may be made distinct
72. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/84
Dynamic Scheduling: General Idea
• The issue process gets divided into two
parts:
1. Checking the presence of structural
hazards
2. Waiting for the absence of a data hazard
• Instructions are issued in order, but they
execute and complete as soon as their
data operands are available
• Data flow approach
73. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/85
Dynamic Scheduling: General Idea
• The ID pipeline stage is divided into two
sub-stages:
• ID.1 (Issue) : decode the instruction,
check for structural hazards
• ID.2 (read operands) : wait until no data
hazards, then read operands
74. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/86
Dynamic Scheduling: General Idea
• In the DLX floating point pipeline, the EX
stage of instructions may take multiple
cycles
• For each issued instruction I, depending
on the resolution of structural and data
hazards, I may be be waiting for
resources or data, or in execution, or
completed
• More than a single instruction can be in
execution at the same time
75. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Scoreboarding
• Scorebord (CDC6600, 1964): a technique
to allow instructions to execute out of
order when there are sufficient resources
and no data dependencies
• Goal: execution rate of 1 instruction per
clock cycle in the absence of structural
hazards
• Large set of FUs:
4 FPUs,
5 units for memory references
7 integer FUs
Highly redundant (parallel) system
• Four steps replace the ID, EX, WB stages
2.3/87
76. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/88
Scoreboarding
Avoids WAWs
• IF (a FU is available && no active
instruction has same destination reg) {
issue I to the FU; update state;
}
• ASA (the two source operands are
available in the registers) {
read operands;
manage RAW stalls;
}
• For each FU: ASA (operands are available)
{ start EX; EOX? Alert scoreboard; }
Avoids WARs
• When at WB:
{ wait for (no WAR hazards);
store output to destination reg; }
77. © V. De Florio
KULeuven 2002
Basic
Concepts
Scoreboarding
• In eliminating stalls, a scoreboard is
limited by several factors:
Amount of parallelism available among the
instructions
Computer
Design
(in the presence of many dependencies there’s
not much that one can do…)
Number of scoreboard entries
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/89
(How far ahead the pipeline can look for
independent instructions)
Number and types of FUs
Number of WAR’s and WAW’s
78. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/90
Scoreboarding
• The effectiveness of the scoreboard
heavily depends on the register file
• All operands are read from registers, all
outputs go to destination registers
The availability of registers influence the
capability to eliminate stalls
79. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/91
Tomasulo’s approach
• Tomasulo’s approach (IBM 360/91, 1967) :
An improvement of scoreboarding when a
limited number of registers is allowed by
a machine architecture
• Based on virtual registers
• The IBM 360/91 had two key design goals:
To be faster than its predecessors
To be machine level compatible with its
predecessors
• Problem: the 360 family had only 4 FP
registers
• Tomasulo combined the key ideas of
scoreboarding with register renaming
80. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Tomasulo’s approach
• IBM 360/91 FUs:
3 ADDD/SUBD, 2 MULD, 6 LD, 6 SD
• Key element: the reservation station (RS):
a buffer which holds the operands of the
instructions waiting to issue
• Key concept:
A RS fetches and buffers an operand as soon as
it is available, eliminating the need to get that
operand from a register
Instead of tracing the source and destination
registers, we track source and destination RS’s
RSa
RSb
OP
2.3/92
RSc
81. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/93
Tomasulo’s approach
• A reservation station represents:
A static data, read from a register
A “live” data (a future data) that will be
produced by another RS and FU
• Hazard detection and execution control
are not centralised into a scoreboard
• They are distributed in each RS, which,
independently:
Controls a FU attached to it,
And starts that FU the moment the operands
become available
82. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Tomasulo’s approach
• The operands go to the FUs through the
(wide set of) RS’s, not through the (small)
register file
• This is managed through a broadcast that
makes use of a
common result-or-data bus
• All units waiting for an operand can load
it at the same time:
RSa
RSb
RSb
RSd
OP2
RSc
2.3/94
OP
RSe
83. © V. De Florio
KULeuven 2002
Basic
Concepts
Tomasulo’s approach
• The execution is driven by a graph of
dependencies
RSg
RSf
SUBD
Computer
Design
RSb
RSa
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/95
RSd
SUBD
MULTD
RSc
RSe
• A “live data structure” approach (similar
to LINDA): a tuple is made available in the
future, when a thread will have finished
producing it
84. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/100
Major Advantages of Tomasulo’s
• Distributed approach: the RS’s
independently control the FU’s
• Distributed hazard detection logic
• The CDB broadcasts results -> all pending
instructions depending on that result are
unblocked simultaneously
The CDB, being a bus, reaches many
destinations in a single clock cycle
If the waiting instructions get their missing
operand in that clock cycle, they can all begin
execution on the next clock cycle
• WAR and WAW are eliminated by
renaming registers using the RS’s
85. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/101
Reducing branch penalties
• Static Approaches
Dynamic Approaches
86. Reducing branch penalties:
Dynamic Branch Prediction
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
• A branch history table
Address
.
.
.
Branch
Nature
0xA0B2DF37 BNEZ …
Computer
Architectures
for AI
Computer
Architectures
In Practice
taken
0xA0B2F02A BEQ …
taken
.
.
.
.
.
.
0xA0B30504
.
.
.
0xA0B30537
2.3/102
untaken
BNEZ …
.
.
.
taken
untaken
untaken
2A
.
.
.
un taken
BGT …
04
37
.
.
.
87. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/103
Dynamic Branch Prediction
Branch History Table Algorithm
/* before the branch is evaluated */
If (Current instruction is a branch) {
entry = PC & 0x000000FF;
predict branch as ( BHT [ entry ] );
}
/* after the branch */
If (branch was mispredicted)
BHT [ entry ] = 1 – BHT [ entry ]
88. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/104
Dynamic Branch Prediction
Branch History Table Algorithm
• Just one bit is enough for coding the
Boolean value “taken” vs. “untaken”
• Note: the function associating addresses
to entries in the BHT is not guaranteed to
be a bijection (one-to-one relationship):
• The algorithm records the most recently
behaviour of one or more branches
For instance, entry 37 corresponds to two b.’s
• Despite this, the scheme works well…
• …though in some cases, the performance
of the scheme is not that satisfactory:
89. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/105
Dynamic Branch Prediction
Branch History Table Accuracy
• for (i=0; i<BIGN; i++)
for (j=0; j<9; j++)
{ do stg(); }
• Loop is
taken nine times in a row
then not taken once
• Taken 90%, Untaken 10%
• What is the prediction accuracy?
90. Dynamic Branch Prediction
Branch History Table Accuracy
© V. De Florio
KULeuven 2002
Basic
Concepts
9
Computer
Design
9
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/106
9
Taken
Taken
...
Taken
Untaken
Taken
Taken
...
Taken
Untaken
Taken
Taken
...
Taken
Untaken
Taken
U
T
0
1
T
T
U
T
1
0
0
1
2 mispredictions
T
T
U
T
1
0
0
1
2 mispredictions
T
T
U
1
0
0
8 successful
predictions
8 successful
predictions
8 successful
predictions
2 mispredictions
S.S. Prediction accuracy is just 80% !
91. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Dynamic Branch Prediction
Branch History Table Accuracy
• Loop branches (taken n-1 times in a row,
untaken once)
• Performance of this dynamic branch
predictor (based on a single-bit prediction
entry):
Misprediction: 2 x 1 / n
Twice rate of untaken branches
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/107
92. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Dynamic Branch Prediction
Two-bit Prediction Scheme
• Use a two bit field as a “branch behaviour
recorder”
• Allow a state to change only when two
mispredictions in a row occur:
Taken
Computer
Architectures
for AI
Not taken
Predict taken
Predict taken
Taken
Computer
Architectures
In Practice
Not taken
Taken
Not taken
Predict not taken
Predict not taken
Taken
Not taken
2.3/108
93. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Dynamic Branch Prediction
Branch History Table Accuracy
Taken
Taken
Taken
Taken
...
Taken
Untaken
Taken
...
Taken
Untaken
Taken
...
Taken
U2
U
T2
T2
0
0
1
1
T2
T
T2
1
0
1
T2
T
T2
1
0
1
T2
1
2 mispredictions first
7 successful
predictions
9 successful
predictions
STEADY
STATE
9 successful
predictions
S.S. Prediction accuracy is now 90%
2.3/109
94. © V. De Florio
KULeuven 2002
Basic
Concepts
Dynamic Branch Prediction
Branch History Table Accuracy
Prediction accuracy with programs from
SPEC89 – 2-bit prediction buffer of 4096
entries
Computer
Design
nasa7
matrix300
Computer
Architectures
for AI
1%
0%
tomcatv
1%
doduc
5%
Computer
Architectures
In Practice
spice
9%
fpppp
SPEC89
benchmarks
9%
gcc
12%
espresso
5%
18%
eqntott
10%
li
0% 2%
4%
6%
8%
10% 12% 14% 16%
Frequency of mispredictions
2.3/110
18%
95. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/111
Dynamic Branch Prediction
General Scheme
• In the general case, one could use an
n-bit branch behaviour recorder and a
branch history table of 2m entries
• In this case
A change occurs every 2n-1 mispredictions
There is a higher chance that not too many
branch addresses be associated with the same
BHT entry
Larger memory penalty
96. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/112
D.B.P. Comparing the 2-bit with
the General Case
97. © V. De Florio
KULeuven 2002
Basic
Concepts
Dynamic Branch Prediction
Schemes
• One-bit prediction buffer
Good, but with limited accuracy
• Two-bit prediction buffer
Computer
Design
Very good, greater accuracy, slightly higher
overhead
• Infinite-bit prediction buffer
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/113
As good as the two-bit one, but with a very
large overhead
• Correlating predictors
98. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/114
Dynamic Branch Prediction
Correlated predictors
• Two-level predictors
• If the behaviour of a branch is correlated
to the behaviour of another branch,
no single-level predictor would be able to
capture its behaviour
• Example:
if (aa == 2)
aa = 0;
if (bb == 2)
bb = 0;
if (aa != bb) {
…
• If we keep track of the recent behaviour
of other previous branches, our accuracy
may increase
99. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/115
Dynamic Branch Prediction
Correlated predictors
• A simpler example:
if (d == 0) d = 1;
if (d == 1) …
• In DLX, this is
BNEZ
MOV
L1: SUBI
BNEZ
...
L2: . . .
R1,
R1,
R3,
R3,
L1
; b1 ( d != 0 )
#1
R1, #1
L2
; b2 ( d != 1)
100. Dynamic Branch Prediction
Correlated predictors
© V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
• In DLX, this is
BNEZ
R1, L1
; b1 ( d != 0 )
MOV
R1, #1
L1: SUBI
R3, R1, #1
BNEZ
R3, L2
; b2 ( d != 1)
...
L2: . . .
• Let us assume that d is 0, 1 or 2
Initial value d==0?
of d
b1
Value of d d==1?
before b2
b2
0
Untaken
1
Yes
Untaken
1
2.3/116
Yes
No
Taken
1
Yes
Untaken
2
No
Untaken
2
No
Taken
101. Dynamic Branch Prediction
Correlated predictors
© V. De Florio
KULeuven 2002
Basic
Concepts
Initial value d==0?
of d
B1
Value of d d==1?
before b2
b2
0
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/117
Yes
Untaken
1
Yes
Untaken
1
No
Taken
1
Yes
Untaken
2
No
Untaken
2
No
Taken
• This means that
(B1 == untaken ) (B2 == untaken )
• A one-bit predictor may not be able to
capture this property and behave very
badly
102. Dynamic Branch Prediction
Correlated predictors
© V. De Florio
KULeuven 2002
Basic
Concepts
• Let us suppose that d alternates between 2 and 0
• This is the table for the one-bit predictor:
d
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/118
b1
action
2
NT
T
T
0
Computer
Design
b1
pred
new b1
pred
T
NT
NT
2
NT
T
T
0
T
NT
NT
b2
b2
pred action
NT
T
NT
T
• ALL branches are mispredicted!
new b2
pred
T
T
NT
NT
T
T
NT
NT
103. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Dynamic Branch Prediction
Correlated predictors
• Correlated predictor: example:
• Every branch, say branch number j>1, has
two separate prediction bits
First bit: predictor used if branch j-1 was NT
Second bit: otherwise
• At the end of branch j-1:
Behaviour_j_min_1 = (taken?) 1 : 0;
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/119
• At the beginning of branch j:
predict branch as (
BHT [ Behaviour_j_min_1 ] [ entry ] );
• At the end of branch j
If (branch was mispredicted)
BHT [ B.. ] [ entry ] = 1 – BHT [ B.. ] [ entry ]
104. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/120
Dynamic Branch Prediction
Correlated predictors
• The behaviour of a branch
selects a one-bit branch predictor
• If the prediction is not OK, its state is
flipped
105. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/121
Dynamic Branch Prediction
Correlated predictors
• We may also consider the last TWO
branches
The behaviour of these two branches selects,
e.g., a one-bit predictor
(NT NT, NT T, T NT, T T) (0-3) BHT [0..3]
This is called a (2,1) predictor
Or, the behaviour of the last two branches
selects an n-bit predictor
This is a (2, n) predictor
106. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/122
Dynamic Branch Prediction
Correlated predictors
A (2,2) predictor: A 2-bit branch history entry selects
a 2-bit predictor
107. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/123
Dynamic Branch Prediction
Correlated predictors
• General case: (m, n) predictors
Consider the last m branches and their 2m
possible values
This m-tuple selects an n-bit predictor
A change in the prediction only occurs after 2n-1
mispredictions
108. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/124
Dynamic Branch Prediction
Branch-Target Buffer
• A run-time technique to reduce the
branch penalty
• In DLX, it is possible to “predict” the new
PC, via a branch prediction buffer, during
the second stage of the pipeline
• With a Branch-Target Buffer (BTB), the
new PC can be derived during the first
stage of the pipeline
109. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/125
Dynamic Branch Prediction
Branch-Target Buffer
• The BTB is a branch-prediction cache that
stores the addresses of taken branch
• An associative array which works as
follows:
(instruction address) (branch target address)
• In case of a hit, we know the predicted
instruction address one cycle earlier w.r.t.
the branch prediction buffer
• Fetching begins immediately at the
predicted PC
110. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/126
Dynamic Branch Prediction
Branch-Target Buffer
• Design issues:
The entire address must be used
(correspondence must be one-to-one)
Limited number of entries in the BTB
Most frequently used
BTB requires a number of actions to be
executed during the first pipeline stage, also in
order to update the state of the buffer
The pipeline management gets more complex and
the clock cycle duration may have to be
increased
111. Dynamic Branch Prediction
Branch-Target Buffer
© V. De Florio
KULeuven 2002
Basic
Concepts
• Total branch penalty for a BTB
• Assumptions: penalties are as follows
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/127
Prediction
Actual
branch
Penalty
cycles
Yes
Taken
Taken
0
Yes
Computer
Design
Instruction
is in buffer
Taken
Untaken
2
No
*
Taken
2
• Prediction accuracy: 90%
• Hit rate in buffer: 90%
• Taken branch frequency: 60%
112. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Dynamic Branch Prediction
Branch-Target Buffer
• Branch penalty =
90%
Percent buffer hit rate x
10%
Percent incorrect predictions x
Penalty
10%
+ (1 - Percent buffer hit rate) x
Percent taken branches x
60%
Penalty =
90%x10%x2 + 10%x60%x2 = 0.18+0.12=
0.30 clock cycles (vs. 0.50 for delayed br.)
Prediction
Actual
branch
Penalty
cycles
Yes
Taken
Taken
0
Taken
Untaken
2
No
2.3/128
Instruction
is in buffer
Yes
Computer
Architectures
In Practice
*
Taken
2
113. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Dynamic Branch Prediction
Branch-Target Buffer
• The same approach can be applied to the
procedures return addresses
• Example:
0x4ABC CALL 0x30A0
0x4AC0 …
…
0x4CF4 CALL 0x30A0
0x4CF8 …
…
0x4AC0
Computer
Architectures
In Practice
2.3/129
0x30A0
0x4CF8
• Associative arrays of stacks
• If cache is large enough, all return
addresses are predicted correctly
114. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/130
Parallelism
• Introduction to parallel processing
• Instruction level parallelism
Introduction
VLIW
Advanced pipelining techniques
Superscalar
115. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Superscalar architectures
• So far, the goal was reaching the ideal
CPI = 1 goal
• Further increasing performance by having
CPI < 1 is the goal of
superscalar processors (SP)
• To reach this goal, SP issue multiple
instructions in the same clock cycle
• Multiple-issue processors
VLIW (seen already)
SP
Statically scheduled (compiler)
Dynamically scheduled (HW;
Scoreboarding/Tomasulo)
• In SP, a varying # of instructions is
issued, depending on structural limits and
dependencies
2.3/131
116. © V. De Florio
KULeuven 2002
Basic
Concepts
Superscalar architectures
•
•
1. One of: load, store (integer or FP), branch,
integer ALU operation
2. A FP ALU operation
Computer
Design
•
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/132
Superscalar version of DLX
At most two instructions per clock cycle
can be issued
•
IF and ID operate on 64 bits of
instructions
Multiple independent FPU are available
117. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/133
Superscalar architectures
• The superscalar DLX is indeed a sort of
“bidimensional pipeline”:
Integer Instr.
FP Instr.
Integer Instr.
FP Instr.
Integer Instr.
FP Instr.
Integer Instr.
FP Instr.
IF
IF
ID
ID
IF
IF
EX
EX
ID
ID
IF
IF
MEM
MEM
EX
EX
ID
ID
IF
IF
WB
WB
MEM
MEM
EX
EX
ID
ID
WB
WB
MEM
MEM
EX
EX
WB
WB
MEM WB
MEM WB
118. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Superscalar architectures
• Every new solution breeds new problems..
• Latencies!
• When the latency of the load is 1:
In the “monodimensional pipeline”, one cannot
use the result of the load in the current and
next cycle:
Computer
Architectures
for AI
Computer
Architectures
In Practice
LD NOP LDc
P
In the bidimensional pipeline of SP, this means
a loss of three cycles:
Pfp
NOP NOP LDc
LD NOP LDc’
2.3/134
Pi
119. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/135
Superscalar architectures
• Let us consider again the following loop:
Loop: LD
F0, 0(R1)
ADDD F4, F0, F2
SD
0(R1), F4
SUBI R1, R1, #8
BNEZ R1, Loop
• Let us perform unrolling (x5) + scheduling
on the Superscalar DLX:
120. © V. De Florio
KULeuven 2002
Superscalar architectures
Integer
Basic
Concepts
Loop:
FP
Cycle
LD F0, 0(R1)
1
LD F6, -8(R1)
2
LD F10, -16(R1)
LD F14, -24(R1)
ADDD F8,F6,F2
4
ADDD F12,F10,F2
5
SD 0(R1), F4
Computer
Architectures
for AI
3
LD F18, -32(R1)
Computer
Design
ADDD F4,F0,F2
ADDD F16,F14,F2
6
SD -8(R1), F8
ADDD F20,F18,F2
7
8
SD -24(R1), F16
Computer
Architectures
In Practice
SD -16(R1), F12
9
SUBI R1, R1, #40
10
BNEZ R1, Loop
11
SD -32(R1), F20
12
• 12 clock cycles per 5 iterations = 2.4 cc/i
2.3/136
121. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Superscalar architectures
• Superscalar = 2.4 cc/i vs normal = 3.5 cc/i
• But in the example there were not enough
FP instructions to keep the FP pipeline in
use
From cycle 8 to cycle 12 and for the first two
cycles, each cycle holds just one instruction
• How to get more?
Dynamic scheduling for SP
Computer
Architectures
In Practice
2.3/137
Multicycle extension of the Tomasulo algorithm
122. © V. De Florio
KULeuven 2002
Basic
Concepts
Superscalar architectures and the
Tomasulo algorithm
• Idea: employing separate data structures
for the Integer and the FP registers
Integer Reservation Stations (IRS)
FP Reservation Stations (FRS)
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/138
• In the same cycle, issue a FP (to a FRS)
and an integer instruction (to a IRS)
• Note: issuing does not mean executing!
Possible dependencies might serialize the two
instructions issued in parallel
• Dual issue is obtained
pipelining the instruction-issue stage
so that it runs twice as fast
123. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Superscalar architectures
• Multiple issue strategy’s inherent
limitations:
The amount of ILP may be limited (see loop
p.134)
Extra HW is required
Multiple FPU and IU
More complex (-> slower) design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Extra need for large memory and register-file
bandwith
Increase in code size due to hard loop unrolling
Recall: CPUTIME(p) =
2.3/139
IC(p) CPI(p)
clock rate
124. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Superscalar architectures:
compiler support
• Symbolic loop unrolling
The loop is not physically unrolled, though
reorganized, so to eliminate dependencies
• Software pipelining:
Dependencies are eliminated by interleaving
instructions from different iterations of the loop
Loop is not unrolled
<startup>
Loop: LD
ADDD
SD
SUBI
BNEZ
F0, 0(R1)
F4, F0, F2
0(R1), F4
R1, R1, #8
R1, Loop
RAW: problematic
2.3/140
Loop: SD
ADDD
LD
SUBI
BNEZ
<clean-up>
0(R1), F4
F4, F0, F2
F0, -16(R1)
R1, R1, #8
R1, Loop
WAR: HW removable
125. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/141
Superscalar architectures:
compiler support
• Trace scheduling
• Aim: tackling the problem of too short
basic blocks
• Method:
Trace selection
Trace compaction
126. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Superscalar architectures:
compiler support
• Trace selection:
A number of contiguous basic blocks are put
together into a “trace”
Using static branch prediction, the conditional
branches are chosen as taken/untaken, while
loop branches are considered as taken
A
test
Computer
Architectures
for AI
Computer
Architectures
In Practice
A
B
B
X
C
2.3/142
C
Bookkeeping
127. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Superscalar architectures:
compiler support
• Trace compaction:
The resulting trace is a longer straight-line of
code
Trace compaction: global code scheduling
A
B
Code scheduling with
a basic block whose size
is that of A + B + C
C
Bookkeeping
• Speculative movement of code
2.3/143
128. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/144
Superscalar architectures:
HW support
• Conditional instructions: instructions like
CMOVZ R2, R3, R1
which means
if (R1 == 0) R2 = R3;
or
(R1)? R2 = R3 : /* NOP */;
• The instruction turns into a NOP if the
condition is not met
This also means that no exception are raised!
• Using conditional instructions we convert
a control dependence (due to a branch)
into a data dependence
• Speculative transformation in a two-issue
superscalar with conditional instructions:
129. © V. De Florio
KULeuven 2002
Superscalar architectures: HW
support : conditional instructions
Integer
FP
LW R1, 40(R2)
ADDD R3,R4,R5
1
ADDD R6,R3,R7
Basic
Concepts
Cycle
2
Computer
Architectures
for AI
Computer
Architectures
In Practice
BEQZ R10, L
3
LW R8, 20(R10)
4
LW R9,0(R8)
Computer
Design
5
LW R1, 40(R2)
ADDD R3,R4,R5
1
LWC R8,20(R10),R10 ADDD R6,R3,R7
2
BEQZ R10, L
3
LW R9,0(R8)
4
We speculate on the outcome of the branch. If the
condition is not met, we don’t slow down the execution,
because we had used a slot that would otherwise be lost
2.3/145
130. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Superscalar architectures: HW
support : conditional instructions
• Conditional instructions are useful to
implement short alternative control flows
• Their usefulness though is limited by
several factors:
Conditional instructions that are annullated
still take execution time – unless they are
scheduled into waste slots
They are good only in limited cases, when
there’s a simple alternative sequence
Moving an instruction across multiple branches
would require double-conditional instructions!
LWCC R1, R2, R10, R12
(makes no sense)
They require to do extra work w.r.t. their
“regular” version
2.3/146
The extra time required for the test may require
more cycles than the regular versions
131. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Superscalar architectures: HW
support : conditional instructions
• Most architectures support a few
conditional instructions (conditional
move)
• The HP PA architecture allows any
register-register instruction to turn the
next instruction into a NOP – which
makes that a conditional instruction
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/147
• Exceptions
132. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/148
Superscalar architectures: HW
support : conditional instructions
• Exceptions:
Fatal (normally causing termination; e.g.,
memory protection violation)
Resumable exceptions (causing a delay, but no
termination; e.g., page fault exception)
• Resumable exceptions can be processed
for speculative instructions just as if they
were normal instructions
Corresponding time penalty is not considered
as incorrect
• Fatal exceptions cannot be handled by
speculative instructions, hence must be
deferred to the next non-speculative
instructions
133. Superscalar architectures: HW
support : conditional instructions
© V. De Florio
KULeuven 2002
Basic
Concepts
•
Moving instructions across a branch
must not affect
The (fatal) exception behaviour
The data dependences
Computer
Design
•
How to obtain this?
1. All the exceptions triggered by speculative
instructions are ignored by HW and OS
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/149
The HW and OS do handle all exceptions, but
return an undefined value for any fatal
exception. The program is allowed to continue
– though this will almost certainly lead to
incorrect results
Note: scheme 1. can never cause a correct
program to fail, regardless the fact that you
used or not speculation
134. Superscalar architectures: HW
support : conditional instructions
© V. De Florio
KULeuven 2002
2. Poison bits: A speculative instructions does
not trigger any exception, but turns a bit on in
the involved result registers. Next “normal”
(non-speculative) instruction using those
registers will be “poisoned” -> it will cause an
exception
3. Boosting: Renaming and buffering in the HW
(similar to the Tomasulo approach)
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
•
Speculation can be used, e.g., to
optimize an if-the-else such as
if (a==0) a = b; else a = a + 4
or, equivalently,
a = (a==0)? b : a + 4
2.3/150
135. Superscalar architectures: HW
support : conditional instructions
© V. De Florio
KULeuven 2002
Basic
Concepts
•
•
Computer
Design
Computer
Architectures
for AI
•
Computer
Architectures
In Practice
2.3/151
•
•
Suppose A is in 0(R3) and B in 0(R2)
Example:
LW R1, 0(R3) ; load A
BNEZ R1, L1
; A != 0 ? GOTO L1
LW R1, 0(R2) ; load B
J
L2
; skip ELSE
L1:ADD R1,R1,4
; ELSE part
L2:SW 0(R3), R1 ; store A
Speculation:
LW R1, 0(R3) ; load A
LW R9, 0(R2) ; load speculatively B
BNEZ R1, L3
ADD R9, R1, 4 ; here R9 is A+4
L3: SW 0(R3), R9 ; here R9 is A+4 or B
In this case, a temporary register is used
Method 1: speculation is transparent
136. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/152
Superscalar architectures: HW
support : conditional instructions
• Method 2 applied to the previous code
fragment:
LW R1, 0(R3) ; load A
LW* R9, 0(R2) ; load speculatively B
BNEZ R1, L3
ADD R9, R1, 4 ; here R9 is A+4
L3: SW 0(R3), R9 ; here R9 is A+4 or B
• LW* is a speculative version of LW
• LW* an opcode that turns on the poison
bit of register R9
• Next non speculative instruction using R9
will be “poisoned”: it will cause an
exception
• If another speculative instruction uses
R9, the poison bit will be inherited
137. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/153
Superscalar architectures: HW
support : conditional instructions
• Combining speculation with dynamic
scheduling
An attribute bit is added to each instruction
(1: speculative, 0: normal)
When that bit is 1, it is allowed to execute, but
cannot enter the commit (WB) stage
The instruction then has to wait until the end of
the speculated code
It will be allowed to modify the register file /
memory only at end of speculative-mode
• Hence: instructions execute out-of-order,
but are forced to commit in order
• A special set of buffers holds the results
that have finished execution but have not
committed yet (reorder buffers)
138. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/154
Superscalar architectures: HW
support : conditional instructions
• As neither the register values nor the
memory values are actually WRITTEN
until an instruction commits,
the processor can easily undo its
speculative actions when a branch is
found to be mispredicted
• If a speculated instruction raises an
exception, this is recorded in the reorder
buffer
• In case of branch misprediction such that
a certain speculative instruction should
not have been executed, the exception is
flushed along with the instruction when
the reorder buffer is cleared
139. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/155
Superscalar architectures: HW
support : conditional instructions
• Reorder buffers:
An additional set of virtual registers that hold
the result of the instructions
That have finished execution, but
Have not committed yet
Issue: only when both a Reservation Station
and a reorder buffer are available
As soon as an instruction completes, its output
goes into its reorder buffer
Until the instruction has not committed, input
is received from the reorder buffer
(the Reservation Station is freed, the reorder
buffer is not)
The actual updating of registers takes place
when the instruction reaches the top of the list
of reorder buffers
140. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/156
Superscalar architectures: HW
support : conditional instructions
• At this point the commit phase takes
place:
Either the result is written into the register file,
Or, in case of a mispredicted branch, the
reorder buffer is flushed and execution restarts
at the correct successor of the branch
• Assumption: when a branch with
incorrect prediction reaches the head of
the buffer, it means that the speculation
was wrong
141. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/157
Superscalar architectures: HW
support : conditional instructions
• This technique allows also to tackle situation like
if (cond) do_this ; else do_that ;
• One may “bet” on the outcome of the branch and
say, e.g., it will be a taken one
• Even unlikely events do happen, so sooner or later
a misprediction occurs
• Idea: let the instructions in the else part (do_that)
issue and execute, with a separate list of reorder
buffers (list2)
• This second list is simpler: we don’t check for the
current head-of-list. Elements in there need to be
explicitly removed
• In case of a misprediction, in the second list we
have already executed the do_that part, and we
just need to perform its commit
• In case of positive prediction, the ELSE part is
purged off list2
142. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/158
Superscalar architectures
• If a processor A has a lower CPI w.r.t
another processor B, will A always run
faster than B?
• Not always!
A higher clock rate is indeed a deterministic
measure of the performance improvement
A multiple issue (superscalar) architecture
cannot guarantee its improvements (stochastic
improvements)
Pushing towards a low CPI means adapting
sophisticated (=complex) techniques… which
slows down the clock rate!
Improving one aspect of a M.I.P. does not
necessarily lead to overall performance
improvements
143. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/159
Superscalar architectures
• A simple question:
“how much ILP exists in a program?”
or, in other words, “how much can we
expect from techniques that are based on
the exploitation of the ILP?”
• How to proceed:
Delivering a set of very optimistic assumptions
and measuring how much parallelism is
available under those assumptions
144. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/160
Superscalar architectures
•
Assumptions (HW model of an ideal
processor):
1. Infinite # of virtual registers (-> no WAW or
WAR can suspend the pipeline)
2. All conditional branches are predicted exactly
(!!)
3. All computed jumps and returns are perfectly
predicted
4. All memory addresses are known exactly, so a
store can be moved before a load – provided
that the addresses are not identical
5. Infinite issue processor
6. No restriction about the types of instructions
to be executed in a cycle (no structural
hazards)
7. All latencies are 1
145. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/161
Superscalar architectures
• How to match these assumptions??
• Gambling!
• We run a program and produce a trace
with all the values of all the instances of
each branch
Taken, Taken, Taken, Untaken, Taken, …
Each corresponding target address is recorded
and assumed to be available
Then we use a simulator to mimic, e.g., an
infinite virtual registers machine etc.
• Results are depicted in next picture
• Parallelism is expressed in IPC:
instruction issues per clock cycles
146. © V. De Florio
KULeuven 2002
Superscalar architectures
Basic
Concepts
54.8
gcc
espresso
Computer
Design
SPEC
benchmarks
li
fpppp
doduc
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/162
62.6
17.9
75.2
118.7
150.1
tomcatv
140
160
• Tomcatv reaches 150 IPC (for a particular
run)
147. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/163
Superscalar architectures
• Then we can diminish the above
assumptions and introduce limitations
that represent our current possibilities
with computer design techniques for ILP
Window size: the actual range of instructions
we inspect when looking for candidates for
contemporary issuing
Realistic branch prediction
Finite # of registers
• See images 4-39 and 4-40
148. © V. De Florio
KULeuven 2002
Superscalar architectures
Basic
Concepts
160
140
120
Computer
Design
100
Instruction issues
per cycle
80
60
Computer
Architectures
for AI
40
20
0
Computer
Architectures
In Practice
Infinite
2k
512
128
32
Window size
gcc
li
fpppp
2.3/164
espresso
doduc
tomcatv
8
4
149. © V. De Florio
KULeuven 2002
Superscalar architectures
55
10
10
gcc
Basic
Concepts
8
4
3
63
15
13
espresso
8
4
3
Computer
Design
18
12
11
9
li
4
3
Benchmarks
75
49
Computer
Architectures
for AI
35
fpppp
14
5
3
119
16
15
doduc
Computer
Architectures
In Practice
9
4
3
150
45
34
tomcatv
14
6
3
0
20
40
60
80
100
120
Instruction issues per cycle
Infinite
2.3/165
512
8
4
128
32
140
160
150. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/166
Superscalar architectures:
conclusive notes
• In the next 10 years it is realistic to reach
an architecture that looks like this:
64 instruction issues per clock cycle
Selective predictor, 1K entries, 16-entry return
predictor
Perfect disambiguation of memory references
Register renaming with 64 + 64 extra registers
• Computer architectures in practice:
Section 4.8 (PowerPC 620)
151. Superscalar architectures:
conclusive notes
© V. De Florio
KULeuven 2002
• Reachable
performance
Basic
Concepts
60
Computer
Architectures
for AI
Computer
Architectures
In Practice
Instruction issues per cycle
Computer
Design
50
40
30
20
10
0
Infinite
256
128
64
32
16
Window size
gcc
li
fpppp
2.3/167
espresso
doduc
tomcatv
8
4
152. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/168
Pipelining and communications
• Suppose that N+1 processes need to
communicate a private value to all the
others
• They use all the values to produce next
output (e.g., for voting)
• Communication is fully synchronous and
needs to be repeated m times, m large
...
153. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/169
Pipelining and communications
•
•
•
•
Let us assume that no bus is available
Point-to-point communication
Processes are numbered p0…pN
Two instructions are available
Send (pj, value)
Receive (pj, &value)
• Blocking functions
• If the receiver is ready to receive, they
last one stage time, otherwise they block
the caller for a multiple of the stage time
• Sending and receiving occur at discrete
time steps
154. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/170
Pipelining and communications
• In each time t, processor pi may be
Sending data (next stage pi is unblocked)
Receiving data (next stage pi is unblocked)
Blocked in a Receive()
Blocked in a Send()
• Slot = time corresponding to an entire
stage time
• Each time t we have n slots (a slot per
process)
• If pi is blocked, its slot is wasted
(it’s a “bubble”)
• Otherwise the slot is used
155. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/171
Pipelining and communications
• In each time t, processor pi may be in
State
State
State
State
S(j) : Sending data to processor pj
R(j) : Receiving data from pj
WR(j) : Blocked in a Receive( pj, … )
WS(j) : Blocked in a Send( pj, …)
• We use formalism:
proc st proc’
to indicate that, at time t,
proc is in state s with proc’
• For instance
p1 WR(4)21 p3
means that the 21st slot of p1 is wasted
waiting for p3 to send its value to it
156. © V. De Florio
KULeuven 2002
Basic
Concepts
Pipelining and communications
• The following algorithm is executed by
process j:
Before gaining the right to
broadcast, process j needs to go
through j couples of states (WR, R)
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Ordered
broadcast :
the k-th
message
to be sent
goes to
process
pk
Finally, process j goes through N-j
couples of states (WR, R)
2.3/172
157. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Pipelining and communications
• p is a vector of indices
• For process j, p can be any arrangement
of the integers 0, 1, …, j-1, j+1, … N
• Whatever the arrangement, the algorithm
works correctly
• For instance, if N = 4 (5 processes) and
j = 1, then p can be any permutation of
0, 2, 3, and 4
• p determines the order in which process j
Computer
Architectures
In Practice
2.3/173
sends its value to its neighbours
• Example: p[] = [ 3, 2, 0, 4]. Then p1
executes:
send (p3), send(p2), send(p0), send(p4)
158. © V. De Florio
KULeuven 2002
Basic
Concepts
Pipelining and communications
• Example:
p[] = ordered permutation
Ex: N=5 and pj
p
[ 0, … j-1,j+1, … N ]
Duration
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Frequencies of used slots
2.3/174
Slot wasted in send
Slot wasted in receive
159. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Pipelining and communications
• Case N = 20,
p[] = ordered permutation
• Gray = wasted slots
• Black = used slots
• In general, duration is
• Used slots / total # of slots
• Average # used slots during
one stage time
• This image:reminds us of another one:
2.3/175
160. © V. De Florio
KULeuven 2002
Basic
Concepts
Pipelining and communications
Time
6 PM
Computer
Design
7
8
9
10
11
12
1
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
A
Computer
Architectures
for AI
Computer
Architectures
In Practice
B
C
D
No pipelining: Many slots are wasted!
2.3/176
2 AM
161. © V. De Florio
KULeuven 2002
Basic
Concepts
Pipelining and communications
• Let us now consider the case in which
processor k uses
p[] = [ k+1, k+2, …, N, O, 1, …, k-1 ]
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/177
162. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/178
Pipelining and communications
163. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/179
Pipelining and communications
• Duration: first case vs. second case
164. © V. De Florio
KULeuven 2002
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.3/180
Pipelining and communications
• Efficiency: first case vs. second case
165. © V. De Florio
KULeuven 2002
Basic
Concepts
Pipelining and communications
• Algorithm of pipelined broadcast
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Every 10 slots, 5 mark the
completion of a broadcast
Beginning of steady state
Throughput = t / 2 (t = 1 slot)
A full broadcast is finished every 2 t
2.3/181
• The image may remind us of another one…
166. © V. De Florio
KULeuven 2002
Pipelining (slide P2.2/20)
6 PM
7
9
8
10
11
12
1
2 AM
Basic
Concepts
30 30 30 30 30
…A
Computer
Design
Computer
Architectures
In Practice
2.3/182
C
…
D
B
…
Computer
Architectures
for AI
…
Between 7.30 and 9.30pm, a whole job
is completed every 30’
During that period, each worker is
permanently at work…
…but a new input must arrive within 30’