Risc processors all syllabus5

An Introduction to RISC
Processors
We are going to describe how microprocessor manufacturers took a new look at
processor architectures in the 1980s and started designing simpler but faster
processors. We begin by explaining why chip designers turned their backs on the
conventional complex instruction set computer (CISC) such at the 68K and the Intel
86X families and started producingreduced instruction set computers (RISCs) such as
MIPS and the PowerPC. RISC processors have simpler instruction sets than CISC
processors (although this is a rather crude distinction between these families, as we
shall soon see).

By the mid 90s many of these so-called RISC processors were considerably more
complex than some of the CISCs they replaced. That isn't a paradox. The RISC
processor isn't really a cut-down computer architecture—it represents a new approach
to architecture design. In fact, the distinction between CISC and RISC is now so
blurred that virtually all processors now have both RISC and CISC features.

The RISC Revolution
Before we look at the ARM, we describe the history and characteristics of RISC
architecture. From the introduction of the microprocessor in the 1970s to the mid
1980's there seems to have been an almost unbroken trend towards more and more
complex (you might even say Baroque) architectures. Some of these architectures
developed rather like a snowball rolling downhill. Each advance in chip fabrication
technology allowed designers to add more and more layers to the microprocessor's
central core. Intel's 8086 family illustrates this trend particularly well, because Intel
took their original 16-bit processor and added more features in each successive
generation. This approach to chip design leads to cumbersome architectures and
inefficient instruction sets, but it has the tremendous commercial advantage that end
users don't have to pay for new software when they buy the latest reincarnation of a
microprocessor.

A reaction against the trend toward greater architectural complexity began at IBM
with their 801 architecture and continued at Berkeley where Patterson and Ditzel
coined the term RISC to describe a new class of architectures that reversed earlier
trends in microcomputer design. According to popular wisdom RISC architectures are
streamlined versions of traditional complex instruction set computers. This notion is

both misleading and dangerous, because it implies that RISC processors are in some
way cruder versions of existing architectures. In brief, RISC architectures re-deploy to
better effect some of the silicon real estate used to implement complex instructions
and elaborate addressing modes in conventional microprocessors of the 68000 and
8086 generation. The mnemonic "RISC" should really stand for regular instruction set
computer.

Two factors influencing the architecture of first- and second-generation
microprocessors were microprogramming and the desire to help compiler writers by
providing ever more complex instruction sets. The latter is called closing the semantic
gap (i.e., reducing the difference between high-level and low-level languages). By
complex instructions we mean instruction like MOVE 12(A3,D0),D2 and ADD (A6)-
,D3 that carry out multi-step operations in a single machine-level instruction. The
MOVE 12(A3,D0),D2 generates an effective address by adding the contents of A3 to
the contents of D0 plus the literal 12. The resulting address is used to access the
source operand that is loaded into register D2.

Microprogramming achieved its highpoint in the 1970s when ferrite core memory had
a long access time of 1 ms or more and semiconductor high-speed random access
memory was very expensive. Quite naturally, computer designers used the slow main
store to hold the complex instructions that made up the machine-level program. These
machine-level instructions are interpreted by microcode in the much faster
microprogram control store within the CPU. Today, main stores use semiconductor
memory with an access time of 50 ns or less, and most of the advantages of
microprogramming have evaporated. Indeed, the goal of a RISC architecture is to
execute an instruction in a single machine cycle. A corollary of this statement is that
complex instructions can't be executed by RISC architectures. Before we look at RISC
architectures, we have to describe some of the research that led to the search for better
architectures.

Instruction Usage
Computer scientists carried out extensive research over a decade or more in the late
1970s into the way in which computers execute programs. Their studies demonstrated
that the relative frequency with which different classes of instructions are executed is
not uniform and that some types of instruction are executed far more frequently than
other types. Fairclough divided machine-level instructions into eight groups according
to type and compiled the statistics shown in Table 1. The "mean value of instruction
use" gives the percentage of times that instructions in that group are executed
averaged over both program types and computer architecture. These figures relate to
early 8-bit processors.

Table 1 Instruction usage as a function of instruction type

Instruction Group 1 2 3 4 5 6 7 8
Mean value of instruction use 45.28 28.73 10.75 5.92 3.91 2.93 2.05 0.4

These eight instruction groups in table 1 are:

Data movement
Program flow control (i.e., branch, call, return)
Arithmetic
Compare
Logical
Shift
Bit manipulation
Input/output and miscellaneous

Table 1 convincingly demonstrates that the most common instruction type is the data
movement primitive of the form P: = Q in a high-level language or MOVE P,Q in a
low-level language. Similarly, the program flow control group that includes both
conditional and unconditional branches (together with subroutine calls and returns)
forms the second most common group of instructions. Taken together, the data
movement and program flow control groups account for 74% of all instructions. A
corollary of this statement is that we can expect a large program to contain only 26%
of instructions that are not data movement or program flow control primitives.

An inescapable inference from such results is that processor designers might be better
employed devoting their time to optimizing the way in which machines handle
instructions in groups one and two, than in seeking new powerful instructions that are
seldom used. In the early days of the microprocessor, chip manufacturers went out of
their way to provide special instructions that were unique to their products. These
instructions were then heavily promoted by the company's sales force. Today, we can
see that their efforts should have been directed towards the goal of optimizing the
most frequently used instructions. RISC architectures have been designed to exploit
the programming environment in which most instructions are data movement or
program control instructions.

Another aspect of computer architecture that was investigated was the optimum size
of literal operands (i.e., constants). Tanenbaum reported the remarkable result that
56% of all constant values lie in the range -15 to +15 and that 98% of all constants lie
in the range -511 to +511. Consequently, the inclusion of a 5-bit constant field in an
instruction would cover over half the occurrences of a literal. RISC architectures have

sufficiently long instruction lengths to include a literal field as part of the instruction
that caters for the majority of literals.

Programs use subroutines heavily, and an effective architecture should optimize the
way in which subroutines are called, parameters passed to and from subroutines, and
workspace allocated to local variables created by subroutines. Research showed that
in 95% of cases twelve words of storage are sufficient for parameter passing and local
storage. A computer with twelve registers should be able to handle all the operands
required by most subroutines without accessing main store. Such an arrangement
would reduces the processor-memory bus traffic associated with subroutine calls.

Characteristics of RISC Architectures
Having described the ingredients that go into an efficient architecture, we now look at
the attributes of first generation RISCs before covering RISC architectures in more
detail. The characteristics of an efficient RISC architecture are:

RISC processors have sufficient on-chip registers to overcome the worst effects of the
processor-memory bottleneck. Registers can be accessed more rapidly than off-chip
main store. Although today's processors rely heavily on fast on-chip cache memory to
increase throughput, registers still offer the highest performance.

RISC processors have three-address, register-to-register architectures with instructions
in the form OPERATION Ra,Rb,Rc, where Ra, Rb, and Rc are general-purpose
registers.

Because subroutines calls are so frequently executed, (some) RISC architectures make
provision for the efficient passing of parameters between subroutines.

Instructions that modify the flow of control (e.g., branch instructions) are
implemented efficiently because they comprise about 20 to 30% of a typical program.

RISC processors aim to execute one instruction per clock cycle. This goal imposes a
limit on the maximum complexity of instructions.

RISC processors don't attempt to implement infrequently used instructions. Complex
instructions waste silicon real-estate and conflict with the requirements of point 8.
Moreover, the inclusion of complex instructions increases the time taken to design,
fabricate and test a processor.

A corollary of point 5 is that an efficient architecture should not be
microprogrammed, because microprogramming interprets a machine-level instruction
by executing microinstructions. In the limit, a RISC processor is close to a
microprogrammed architecture in which the distinction between machine cycle and
microcode has vanished.

An efficient processor should have a single instruction format (or at least very few
formats). A typical CISC processor such as the 68000 has variable-length instructions
(e.g., from 2 to 10 bytes). By providing a single instruction format, the decoding of a
RISC instruction into its component fields can be performed by a minimum level of
decoding logic. It follows that a RISC's instruction length should be sufficient to
accommodate the operation code field and one or more operand fields. Consequently,
a RISC processor may not utilize memory space as efficiently as does a conventional
CISC microprocessor.

Two fundamental aspects of the RISC architecture that we cover later are its register
set and the use of pipelining. Multiple overlapping register windows were
implemented by the Berkeley RISC to reduce the overhead incurred by transferring
parameters between subroutines. Pipelining is a mechanism that permits the
overlapping of instruction execution (i.e., internal operations are carried out in
parallel). Many of the features of RISC processors are not new, and have been
employed long before the advent of the microprocessor. The RISC revolution
happened when all these performance-enhancing techniques were brought together
and applied to microprocessor design.

The Berkeley RISC
Although many CISC processors were designed by semiconductor manufacturers, one
of the first RISC processors came from the University of California at Berkeley. The
Berkeley RISC wasn't a commercial machine, although it had a tremendous impact on
the development of later RISC architectures. Figure 1 describes the format of a
Berkeley RISC instruction. Each of the 5-bit operand fields (Destination, Source 1,
Source 2) permits one of 32 internal registers to be accessed.

Figure 1 Format of the Berkeley RISC instruction

The single-bit set condition code field, Scc, determines whether the condition code
bits are updated after the execution of an instruction. The 14-bit Source 2 field has
two functions. If the IM bit (immediate) is 0, the Source 2 field specifies one of 32
registers. If the IM bit is 1, the Source 2 field provide a 13-bit literal operand.

Since five bits are allocated to each operand field, it follows that this RISC has 25 =
32 internal registers. This last statement is emphatically not true, since the Berkeley
RISC has 138 user-accessible general-purpose internal registers. The reason for the
discrepancy between the number of registers directly addressable and the actual
number of registers is due to a mechanism called windowing that gives the
programmer a view of only a subset of all registers at any instant. Register R0 is
hardwired to contain the constant zero. Specifying R0 as an operand is the same as
specifying the constant 0.

Register Windows
An important feature of the Berkeley RISC architecture is the way in which it
allocates new registers to subroutines; that is, when you call a subroutine, you get
some new registers. If you can create 12 registers out of thin air when you call a
subroutine, each subroutine will have its own workspace for temporary variables,
thereby avoiding relatively slow accesses to main store.

Although only 12 or so registers are required by each invocation of a subroutine, the
successive nesting of subroutines rapidly increases the total number of on-chip
registers assigned to subroutines. You might think that any attempt to dedicate a set of
registers to each new procedure is impractical, because the repeated calling of nested
subroutines will require an unlimited amount of storage. Subroutines can indeed be
nested to any depth, but research has demonstrated that on average subroutines are not
nested to any great depth over short periods. Consequently, it is feasible to adopt a
modest number of local register sets for a sequence of nested subroutines.

Figure 2 provides a graphical representation of the execution of a typical program in
terms of the depth of nesting of subroutines as a function of time. The trace goes up
each time a subroutine is called and down each time a return is made. If subroutines
were never called, the trace would be a horizontal line. This figure demonstrates is
that even though subroutines may be nested to considerable depths, there are periods
or runs of subroutine calls and returns that do not require a nesting level of greater
than about five.

Figure 2 Depth of subroutine nesting as a function of time

A mechanism for implementing local variable work space for subroutines adopted by
the designers of the Berkeley RISC is to support up to eight nested subroutines by
providing on-chip work space for each subroutine. Any further nesting forces the CPU
to dump registers to main memory, as we shall soon see.

Memory space used by subroutines can be divided into four types:

Global space Global space is directly accessible by all subroutines and holds
constants and data that may be required from any point within the program. Most
conventional microprocessors have only global registers.

Local space Local space is private to the subroutine. That is, no other subroutine can
access the current subroutine's local address space from outside the subroutine. Local
space is employed as working space by the current subroutine.

Imported parameter space Imported parameter space holds the parameters imported
by the current subroutine from its parent that called it. In Berkeley RISC terminology
these are called the high registers.

Exported parameter space Exported parameter space holds the parameters exported
by the current subroutine to its child. In RISC terminology these are called the low
registers.

Windows and Parameter Passing
One of the reasons for the high frequency of data movement operations is the need to
pass parameters to subroutines and to receive them from subroutines.

The Berkeley RISC architecture deals with parameter passing by means of multiple
overlapped windows. A window is the set of registers visible to the current subroutine.
Figure 3 illustrates the structure of the Berkeley RISC's overlapping windows. Only
three consecutive windows (i-1, i, i+1) of the 8 windows are shown in Figure 3. The
vertical columns represent the registers seen by the corresponding window. Each
window sees 32 registers, but they aren't all the same 32 registers.

The Berkeley RISC has a special-purpose register called the window pointer, WP, that
indicates the current active window. Suppose that the processor is currently using
the ith window set. In this case the WP contains the value i. The registers in each of
the 8 windows are divided into four groups shown in Table 2.

Table 2 Berkeley RISC register types

Register name Register type
R0 to R9 The global register set is always accessible.
R10 to R15 Six registers used by the subroutine to receive parameters from its parent an
parent.
R16 to R25 Ten local registers accessed only by the current subroutine that cannot be ac
subroutine.
R26 to R31 Six registers used by the subroutine to pass parameters to and from its own
called by itself).

All windows consist of 32 addressable registers, R0 to R31. A Berkeley RISC
instruction of the form ADD R3,R12,R25 implements [R25] [R3] + [R12], where
R3 lies within the window's global address space, R12 lies within its import from (or
export to) parent subroutine space, and R25 lies within its local address space. RISC
arithmetic and logical instructions always involve 32-bit values (there are no 8-bit or
16-bit operations).

The Berkeley RISC's subroutine call is CALL Rd,<address> and is similar to a
typical CISC instruction BSR <address>. Whenever a subroutine is invoked
by CALLR Rd,<address>, the contents of the window pointer are incremented by
1 and the current value of the program counter saved in register Rd of the new
window. The Berkeley RISC doesn't employ a conventional stack in external main
memory to save subroutine return addresses.

Figure 3 Berkeley windowed register sets

Once a new window has been invoked (in Figure 3 this is window i), the new
subroutine sees a different set of registers to the previous window. Global registers R0
to R9 are an exception because they are common to all windows. Window R10 of the
child (i.e., called) subroutine corresponds to (i.e., is the same as) window R26 of the
calling (i.e., parent) subroutine. Suppose you wish to send a parameter to a subroutine.
If the parameter is in R10 and you call a subroutine, register R26 in this subroutine
will contain the parameter. There hasn't been a physical transfer of data because
register R26 in the current window is simply register R10 in the previous window.

Figure 4 Relationship between register number, window number, and register address

The physical arrangement of the Berkeley RISC's window system is given in Figure 4.
On the left hand side of the diagram is the actual register array that holds all the on-
chip general-purpose registers. The eight columns associated with windows 0 to 7
demonstrate how each window is mapped onto the physical memory array on the chip
and how the overlapping regions are organized. The windows are logically arranged
in a circular fashion so that window 0 follows window 7 and window 7 precedes
window 0. For example, if the current window pointer is 3 and you access register
R25, location 74 is accessed in the register file. However, if you access register R25
when the window pointer is 7, you access location 137.

The total number of physical registers required to implement the Berkeley windowed
register set is:

10 global + 8 x 10 local + 8 x 6 parameter transfer registers = 138 registers.

Window Overflow

Unfortunately, the total quantity of on-chip resources of any processor is finite and, in
the case of the Berkeley RISC, the registers are limited to 8 windows. If subroutines
are nested to a depth greater than or equal to 7, window overflow is said to occur, as
there is no longer a new window for the next subroutine invocation. When an
overflow takes place, the only thing left to do is to employ external memory to hold
the overflow data. In practice the oldest window is saved rather than the new window
created by the subroutine just called.

If the number of subroutine returns minus the number of subroutine calls exceeds 8,
window underflow takes place. Window underflow is the converse of window
overflow and the youngest window saved in main store must be returned to a window.

A considerable amount of research was carried out into dealing with window overflow
efficiently. However, the imaginative use of windowed register sets in the Berkeley
RISC was not adopted by many of the later RISC architectures. Modern RISC
generally have a single set of 32 general-purpose registers.

RISC Architecture and Pipelining
We now describe pipelining, one of the most important techniques for increasing the
throughput of a digital system that uses the regular structure of a RISC to carry out
internal operations in parallel.

Figure 5 illustrates the machine cycle of a hypothetical microprocessor executing an
ADD P instruction (i.e., [A] [R] + [M(P)], where A is an on-chip general purpose
register and P is a memory location. The instruction is executed in five phases:

Instruction fetch Read the instruction from the system memory and increment the
program counter.

Instruction decode Decode the instruction read from memory during the previous
phase. The nature of the instruction decode phase is dependent on the complexity of
the instruction encoding. A regularly encoded instruction might be decoded in a few
nanoseconds with two levels of gating whereas a complex instruction format might
require ROM-based look-up tables to implement the decoding.

Operand fetch The operand specified by the instruction is read from the system
memory or an on-chip register and loaded into the CPU.

Execute The operation specified by the instruction is carried out.

Operand store The result obtained during the execution phase is written into the
operand destination. This may be an on-chip register or a location in external memory.

Figure 5 Instruction Execution

Each of these five phases may take a specific time (although the time taken would
normally be an integer multiple of the system's master clock period). Some
instructions require less than five phases; for example, CMP R1,R2 compares R1 and
R2 by subtracting R1 from R2 to set the condition codes and does not need an operand
store phase.

The inefficiency in the arrangement of Figure 5 is immediately apparent. Consider
the execution phase of instruction interpretation. This phase might take one fifth of an
instruction cycle leaving the instruction execution unit idle for the remaining 80% of
the time. The same rule applies to the other functional units of the processor, which
also lie idle for 80% of the time. A technique called instruction pipelining can be
employed to increase the effective speed of the processor by overlapping in time the

various stages in the execution of an instruction. In the simplest of terms, a pipelined
processor executes instruction i while fetching instruction i + 1 at the same time.

The way in which a RISC processor implements pipelining is described in Figure 6.
The RISC processor executes the instruction in four steps or phases: instruction fetch
from external memory, operand fetch, execute, and operand store (we're using a 4-
stage system because a separate "instruction decode" phase isn't normally necessary).
The internal phases take approximately the same time as the instruction fetch, because
these operations take place within the CPU itself and operands are fetched from and
stored in the CPU's own register file. Instruction 1 in Figure 6 begins in time slot 1
and is completed at the end of time slot 4.

Figure 6 Pipelining and instruction overlap

In a non-pipelined processor, the next instruction doesn't begin until the current
instruction has been completed. In the pipelined system of Figure 6, the instruction
fetch phase of instruction 2 begins in time slot 2, at the same time that the operand is
being fetched for instruction 1. In time slot 3, different phases of instructions 1, 2, and
3 are being executed simultaneously. In time slot 4, all functional units of the system
are operating in parallel and an instruction is completed in every time slot thereafter.
An n-stage pipeline can increase throughput by up to a factor of n.

Pipeline Bubbles
A pipeline is an ordered structure that thrives on regularity. At any stage in the
execution of a program, a pipeline contains components of two or more instructions at
varying stages in their execution. Consider Figure 7 in which a sequence of
instructions is being executed in a 4-stage pipelined processor. When the processor
encounters a branchinstruction, the following instruction is no longer found at the
next sequential address but at the target address in the branch instruction. The
processor is forced to reload its program counter with the value provided by the
branch instruction. This means that all the useful work performed by the pipeline must
now be thrown away, since the instructions immediately following the branch are not
going to be executed.

When information in a pipeline is rejected or the pipeline is held up by the
introduction of idle states, we say that a bubble has been introduced.

Figure 7 The pipeline bubble caused by a branch

As we have already stated, program control instructions are very frequent.
Consequently, any realistic processor using pipelining must do something to
overcome the problem of bubbles caused by instructions that modify the flow of
control (branch, subroutine call and return). The Berkeley RISC reduces the effect of
bubbles by refusing to throw away the instruction following a branch. This
mechanism is called a delayed jump or a branch-and-execute technique because the
instruction immediately after a branch is always executed. Consider the effect of the
following sequence of instructions:

ADD R1,R2,R3 [R3] [R1] + [R2]
JMPX N [PC] [N] Goto address N
ADD R2,R4,R5 [R5] [R2] + [R4] This is executed
ADD R7,R8,R9 Not executed because the branch is taken

The processor calculates R5 := R2 + R4 before executing the branch. This sequence of
instructions is most strange to the eyes of a conventional assembly language
programmer, who is not accustomed to seeing an instruction executed after a branch
has been taken.

Unfortunately, it's not always possible to arrange a program in such a way as to
include a useful instruction immediately after a branch. Whenever this happens, the
compiler must introduce a no operation instruction, NOP, after the branch and accept
the inevitability of a bubble. Figure 8 demonstrates how a RISC processor implements
a delayed jump. The branch described in Figure 8 is a computed branch whose target
address is calculated during the execute phase of the instruction cycle.

Figure 8 Delayed branch

Another problem caused by pipelining is data dependency in which certain sequences
of instructions run into trouble because the current operation requires a result from the
previous operation and the previous operation has not yet left the pipeline. Figure 9
demonstrates how data dependency occurs.

Figure 9 Data dependency

Suppose a programmer wishes to carry out the apparently harmless calculation

X := (A + B)AND(A + B - C).

Assuming that A, B, C, X, and two temporary values, T1 and T2, are in registers in
the current window, we can write:

ADD A,B,T1 [T1] [A] + [B]
SUB T1,C,T2 [T2] [T1] - [C]
AND T1,T2,X [X] [T1] � [T2]

Instruction i + 1 in Figure 9 begins execution during the operand fetch phase of the
previous instruction. However, instruction i + 1 cannot continue on to its operand
fetch phase, because the very operand it requires does not get written back to the
register file for another two clock cycles. Consequently a bubble must be introduced
in the pipeline while instruction i + 1 waits for its data. In a similar fashion, the logical
AND operation also introduces a bubble as it too requires the result of a previous
operation which is in the pipeline.

Figure 10 demonstrates a technique called internal forwarding designed to overcome
the effects of data dependency. The following sequence of operations is to be
executed.

ADD
1. [R3] [R1] + [R2]
R1,R2,R3
ADD
2. [R6] [R4] + [R5]
R4,R5,R6
ADD
3. [R7] [R3] + [R4]
R3,R4,R7
ADD
4. [R8] [R7] + [R1]
R7,R1,R8

Figure 10 Internal forwarding

In this example, instruction 3 (i.e., ADD R3,R4,R7) uses an operand generated by
instruction 1 (i.e., the contents of register R3). Because of the intervening instruction
2, the destination operand generated by instruction 1 has time to be written into the
register file before it is read as a source operand by instruction 3.

Instruction 3 generates a destination operand R7 that is required as a source operand
by the next instruction. If the processor were to read the source operand requested by
instruction 4 from the register file, it would see the old value of R7. By means of
internal forwarding the processor transfers R7 from instruction 3's execution unit
directly to the execution unit of instruction 4 (see Figure 10).

Accessing External Memory in RISC
Systems
Conventional CISC processors have a wealth of addressing modes that are used in
conjunction with memory reference instructions. For example, the 68020 implements
ADD D0,-(A5) which adds the contents of D0 to the top of the stack pointed at by A5
and then pushes the result on to this stack.

In their ruthless pursuit of efficiency, the designers of the Berkeley RISC severely
restricted the way in which it accesses external memory. The Berkeley RISC permits
only two types of reference to external memory: a load and a store. All arithmetic and

logical operations carried out by the RISC apply only to source and destination
operands in registers. Similarly, the Berkeley RISC provides a limited number of
addressing modes with which to access an operand in the main store. It's not hard to
find the reason for these restrictions on external memory accesses—an external
memory reference takes longer than an internal operation. We now discuss some of
the general principles of Berkeley RISC load and store instructions.

Consider the load register operation of the form LDXW (Rx)S2,Rd that has the effect
[Rd] [M([Rx] + S2)]. The operand address is the contents of the memory location
pointed at by register Rx plus offset S2. Figure 11 demonstrates the sequence of
actions performed during the execution of this instruction. During the source fetch
phase, register Rx is read from the register file and used to calculate the effective
address of the operand in the execute phase. However, the processor can't progress
beyond the execute phase to the store operand phase, because the operand hasn't been
read from the main store. Therefore the main store must be accessed to read the
operand and a store operand phase executed to load the operand into destination
register Rd. Because memory accesses introduce bubbles into the pipeline, they are
avoided wherever possible.

Figure 11 The load operation

The Berkeley RISC implements two basic addressing modes: indexed and program
counter relative. All other addressing modes can (and must) be synthesized from these
two primitives. The effective address in the indexed mode is given by:

EA = [Rx] + S2

where Rx is the index register (one of the 32 general purpose registers accessible by
the current subroutine) and S2 is an offset. The offset can be either a general-purpose
register or a 13-bit constant.

The effective address in the program counter relative mode is given by:

EA = [PC] + S2

where PC represents the contents of the program counter and S2 is an offset as above.

These addressing modes include quite a powerful toolbox: zero, one or two pointers
and a constant offset. If you wonder how we can use an addressing mode without an
index (i.e., pointer) register, remember that R0 in the global register set permanently
contains the constant 0. For example, LDXW (R12)R0,R3 uses simple address
register indirect addressing, whereas LDXW (R0)123,R3 uses absolute addressing
(i.e., memory location 123).

There's a difference between addressing modes permitted by load and store
operations. A load instruction permits the second source, S2, to be either an
immediate value or a second register, whereas a store instruction permits S2 to be a
13-bit immediate value only. This lack of symmetry between the load and store
addressing modes is because a "load base+index" instruction requires a register file
with two ports, whereas a "store base+index" instruction requires a register file
with three ports. Two-ported memory allows two simultaneous accesses. Three-ported
memory allows three simultaneous accesses and is harder to design.

Figure 1 defines just two basic Berkeley RISC instruction formats. The short
immediate format provides a 5-bit destination, a 5-bit source 1 operand and a 14-bit
short source 2 operand. The short immediate format has two variations: one that
specifies a 13-bit literal for source 2 and one that specifies a 5-bit source 2 register
address. Bit 13 specifies whether the source 2 operand is a 13-bit literal or a 5 bit
register pointer.

The long immediate format provides a 19-bit source operand by concatenating the two
source operand fields. Thirteen-bit and 19-bit immediate fields may sound a little
strange at first sight. However, since 13 + 19 = 32, the Berkeley RISC permits a full

32-bit value to be loaded into a window register in two operations. In the next section
we will discover that the ARM processor deals with literals in a different way. A
typical CISC microprocessor might take the same number of instruction bits to
perform the same action (i.e., a 32-bit operation code field followed by a 32-bit
literal).

The following describes some of the addressing modes that can be synthesized from
the RISC's basic addressing modes.

1. Absolute addressing
EA = 13-bit offset
Implemented by setting Rx = R0 = 0, S2 = 13-bit constant.

2. Register indirect
EA = [Rx]
Implemented by setting S2 = R0 = 0.

3. Indexed addressing
EA = [Rx] + Offset
Implemented by setting S2 = 13-bit constant.

4. Two-dimensional byte addressing (i.e., byte array access)
EA = [Rx] + [Ry]
Implemented by setting S2 = [Ry].
This mode is available only for load instructions.

Conditional instructions (i.e., branch operations) do not require a destination address
and therefore the five bits, 19 to 23, normally used to specify a destination register are
used to specify the condition (one of 16 since bit 23 is not used by conditional
instructions).

Reducing the Branch Penalty
If we're going to reduce the effect of branches on the performance of RISC
processors, we need to determine the effect of branch instructions on the performance
of the system. Because we cannot know how many branches a given program will
contain, or how likely each branch is to be taken, we have to construct a probabilistic
model to describe the system's performance. We will make the following assumptions:

1. Each non-branch instruction is executed in one cycle
2. The probability that a given instruction is a branch is pb

3. The probability that a branch instruction will be taken is pt
4. If a branch is taken, the additional penalty is b cycles
If a branch is not taken, there is no penalty
If pb is the probability that an instruction is a branch, 1 - pb is the probability that it is
not a branch

The average number of cycles executed during the execution of a program is the sum
of the cycles taken for non-branch instructions, plus the cycles taken by branch
instructions that are taken, plus the cycles taken by branch instructions that are not
taken. We can derive an expression for the average number of cycles per instruction
as:

Tave = (1 - pb)�1 + pb�pt� (1 + b) + pb� (1 - pt) �1 = 1 + pb�pt�b.

This expression, 1 + pb�pt�b, tells us that the number of branch instructions, the
probability that a branch is taken, and the overhead per branch instruction all
contribute to the branch penalty. We are now going to examine some of the ways in
which the value of pb�pt�b can be reduced.

Branch Prediction
If we can predict the outcome of the branch instruction before it is executed, we can
start filling the pipeline with instructions from the branch target address (assuming the
branch is going to be taken). For example, if the instruction is BRA N, the processor
can start fetching instructions at locations N, N + 1, N + 2 etc., as soon as the branch
instruction is fetched from memory. In this way, the pipeline is always filled with
useful instructions.

This prediction mechanism works well with an unconditional branch like BRA N.
Unfortunately, conditional branches pose a problem. Consider a conditional branch of
the form BCC N (branch to N on carry bit clear). Should the RISC processor make the
assumption that the branch will not be taken and fetch instructions in sequence, or
should it make the assumption that the branch will be taken and fetch instruction at
the branch target address N?

As we have already said, conditional branches are required to implement various
types of high-level language construct. Consider the following fragment of high-level
language code.

if (J < K) I = I + L;
(for T = 1; T <= I; T++)

{
.
.
}

The first conditional operation compares J with K. Only the nature of the problem will
tell us whether J is often less than K.

The second conditional in this fragment of code is provided by the FOR construct that
tests a counter at the end of the loop and then decides whether to jump back to the
body of the construct or to terminate to loop. In this case, you could bet that the loop
is more likely to be repeated than exited. Loops can be executed thousands of times
before they are exited. Some computers look at the type of conditional branch and
then either fill the pipeline from the branch target if you think that the branch will be
taken, or fill the pipeline from the instruction after the branch if you think that it will
not be taken.

If we attempt to predict the behavior of a system with two outcomes (branch taken or
branch not taken), there are four possibilities:

1. Predict branch taken and branch taken — successful outcome
2. Predict branch taken and branch not taken — unsuccessful outcome
3. Predict branch not taken and branch not taken — successful outcome
4. Predict branch not taken and branch taken — unsuccessful outcome

Suppose we apply a branch penalty to each of these four possible outcomes. The
penalty is the number of cycles taken by that particular outcome, as table 3
demonstrates. For example, if we think that a branch will not be taken and get
instructions following the branch and the branch is actually taken (forcing the pipeline
to be loaded with instructions at the target address), the branch penalty in table 3
is c cycles.

Table 3 The branch penalty

Prediction Result Branch penalty
Branch taken Branch taken a
Branch taken Branch not taken b
Branch not taken Branch taken c
Branch not taken Branch not taken d

We can now calculate the average penalty for a particular system. To do this we need
more information about the system. The first thing we need to know is the probability

that an instruction will be a branch (as opposed to any other category of instruction).
Assume that the probability that an instruction is a branch is pb. The next thing we
need to know is the probability that the branch instruction will be taken, pt. Finally,
we need to know the accuracy of the prediction. Let pc be the probability that a branch
prediction is correct. These values can be obtained by observing the performance of
real programs. Figure 12 illustrates all the possible outcomes of an instruction. We
can immediately write:

(1 - pb) = probability that an instruction is not a branch.
(1 - pt) = probability that a branch will not be taken.
(1 - pc) = probability that a prediction is incorrect.

These equations are obtained by using the principle that if one event or another must
take place, their probabilities must add up to unity. The average branch penalty per
branch instruction is therefore

Cave = a � (pbranch_predicted_taken_and_taken) + b � (pbranch_predicted_taken_but_not_taken)

+ c � (pbranch_predicted_not_taken_but_taken) + d � (pbranch_predicted_not_taken_and_not_taken)

Cave = a � (pt � pc) + b� (1 - pt) � (1 - pc) + c� pt � (1 - pc) + d � (1 - pt) � pc

Figure 12 Branch prediction

The average number of cycles added due to a branch instruction is Cave � pb

= pb � (a � pt � pc + b � (1 - pt) � (1 - pc) + c � pt � (1 - pc) + d � (1 - pt) � pc).

We can make two assumptions to help us to simplify this general expression. The first
is that a = d = N (i.e., if the prediction is correct the number of cycles is N). The other
simplification is that b = c = B (i.e., if the prediction is wrong the number of cycles
is B). The average number of cycles per branch instruction is therefore:

pb � (N � pt � pc + B � pt � (1 - pc) + B � (1 - pt) � (1 - pc) + N � (1 - pt) � pc)
= pb � (N � pc + B � (1 - pc)).

This formula can be used to investigate tradeoffs between branch penalties, branch
probabilities and pipeline length. There are several ways of implementing branch
prediction (i.e., increasing the value of pc). Two basic approaches are static branch

prediction and dynamic branch prediction. Static branch prediction makes the
assumption that branches are always taken or never taken. Since observations of real
code have demonstrated that branches have a greater than 50% chance of being taken,
the best static branch prediction mechanism would be to fetch the next instruction
from the branch target address as soon as the branch instruction is detected.

A better method of predicting the outcome of a branch is by observing its op-code,
because some branch instructions are taken more or less frequently that other branch
instructions. Using the branch op-code to predict that the branch will or will not be
taken results in 75% accuracy. An extension of this technique is to devote a bit of the
op-code to the static prediction of branches. This bit is set or cleared by the compiler
depending on whether the compiler estimates that the branch is most likely to be
taken. This technique provides branch prediction accuracy in the range 74 to 94%.

Dynamic branch prediction techniques operate at runtime and use the past behavior of
the program to predict its future behavior. Suppose the processor maintains a table of
branch instructions. This branch table contains information about the likely behavior
of each branch. Each time a branch is executed, its outcome (i.e., taken or not taken is
used to update the entry in the table. The processor uses the table to determine
whether to take the next instruction from the branch target address (i.e., branch
predicted taken) or from the next address in sequence (branch predicted not taken).

Single-bit branch predictors provide an accuracy of over 80 percent and five-bit
predictors provide an accuracy of up to 98 percent. A typical branch prediction
algorithm uses the last two outcomes of a branch to predict its future. If the last two
outcomes are X, the next branch is assumed to lead to outcome X. If the prediction is
wrong it remains the same the next time the branch is executed (i.e., two failures are
needed to modify the prediction). After two consecutive failures, the prediction is
inverted and the other outcome assumed. This algorithm responds to trends and is not
affected by the occasional single different outcome.

Problems
1. What are the characteristics of a CISC processor?

2. The most frequently executed class of instruction is the data move instruction. Why
is this?

3. The Berkeley RISC has a 32-bit architecture and yet provides only a 13-bit literal.
Why is this and does it really matter?

4. What are the advantages and disadvantages of register windowing?

5. What is pipelining and how does it increase the performance of a computer?

6. A pipeline is defined by its length (i.e., the number of stages that can operate in
parallel). A pipeline can be short or long. What do you think are the relative
advantages of longs and short pipelines?

7. What is data dependency in a pipelined system and how can its effects be
overcome?

8. RISC architectures don't permit operations on operands in memory other than load
and store operations. Why?

9. The average number of cycles required by a RISC to execute an instruction is given
by Tave = 1 + pb�pt�b.

where

The probability that a given instruction is a branch is pb
The probability that a branch instruction will be taken is pt
If a branch is taken, the additional penalty is b cycles
If a branch is not taken, there is no penalty
Draw a series of graphs of the average number of cycles per instruction as a function
of pb�pt for b = 1, 2, 3, and 4.

10. What is branch prediction and how can it be used to reduce the so-called branch
penalty in a pipelined system?

Risc processors all syllabus5

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Risc processors all syllabus5

Similar a Risc processors all syllabus5 (20)

Último

Último (20)

Risc processors all syllabus5