SlideShare una empresa de Scribd logo
1 de 31
An Introduction to RISC
Processors
We are going to describe how microprocessor manufacturers took a new look at
processor architectures in the 1980s and started designing simpler but faster
processors. We begin by explaining why chip designers turned their backs on the
conventional complex instruction set computer (CISC) such at the 68K and the Intel
86X families and started producingreduced instruction set computers (RISCs) such as
MIPS and the PowerPC. RISC processors have simpler instruction sets than CISC
processors (although this is a rather crude distinction between these families, as we
shall soon see).

By the mid 90s many of these so-called RISC processors were considerably more
complex than some of the CISCs they replaced. That isn't a paradox. The RISC
processor isn't really a cut-down computer architecture—it represents a new approach
to architecture design. In fact, the distinction between CISC and RISC is now so
blurred that virtually all processors now have both RISC and CISC features.

The RISC Revolution
Before we look at the ARM, we describe the history and characteristics of RISC
architecture. From the introduction of the microprocessor in the 1970s to the mid
1980's there seems to have been an almost unbroken trend towards more and more
complex (you might even say Baroque) architectures. Some of these architectures
developed rather like a snowball rolling downhill. Each advance in chip fabrication
technology allowed designers to add more and more layers to the microprocessor's
central core. Intel's 8086 family illustrates this trend particularly well, because Intel
took their original 16-bit processor and added more features in each successive
generation. This approach to chip design leads to cumbersome architectures and
inefficient instruction sets, but it has the tremendous commercial advantage that end
users don't have to pay for new software when they buy the latest reincarnation of a
microprocessor.

A reaction against the trend toward greater architectural complexity began at IBM
with their 801 architecture and continued at Berkeley where Patterson and Ditzel
coined the term RISC to describe a new class of architectures that reversed earlier
trends in microcomputer design. According to popular wisdom RISC architectures are
streamlined versions of traditional complex instruction set computers. This notion is
both misleading and dangerous, because it implies that RISC processors are in some
way cruder versions of existing architectures. In brief, RISC architectures re-deploy to
better effect some of the silicon real estate used to implement complex instructions
and elaborate addressing modes in conventional microprocessors of the 68000 and
8086 generation. The mnemonic "RISC" should really stand for regular instruction set
computer.

Two factors influencing the architecture of first- and second-generation
microprocessors were microprogramming and the desire to help compiler writers by
providing ever more complex instruction sets. The latter is called closing the semantic
gap (i.e., reducing the difference between high-level and low-level languages). By
complex instructions we mean instruction like MOVE 12(A3,D0),D2 and ADD (A6)-
,D3 that carry out multi-step operations in a single machine-level instruction. The
MOVE 12(A3,D0),D2 generates an effective address by adding the contents of A3 to
the contents of D0 plus the literal 12. The resulting address is used to access the
source operand that is loaded into register D2.

Microprogramming achieved its highpoint in the 1970s when ferrite core memory had
a long access time of 1 ms or more and semiconductor high-speed random access
memory was very expensive. Quite naturally, computer designers used the slow main
store to hold the complex instructions that made up the machine-level program. These
machine-level instructions are interpreted by microcode in the much faster
microprogram control store within the CPU. Today, main stores use semiconductor
memory with an access time of 50 ns or less, and most of the advantages of
microprogramming have evaporated. Indeed, the goal of a RISC architecture is to
execute an instruction in a single machine cycle. A corollary of this statement is that
complex instructions can't be executed by RISC architectures. Before we look at RISC
architectures, we have to describe some of the research that led to the search for better
architectures.

Instruction Usage
Computer scientists carried out extensive research over a decade or more in the late
1970s into the way in which computers execute programs. Their studies demonstrated
that the relative frequency with which different classes of instructions are executed is
not uniform and that some types of instruction are executed far more frequently than
other types. Fairclough divided machine-level instructions into eight groups according
to type and compiled the statistics shown in Table 1. The "mean value of instruction
use" gives the percentage of times that instructions in that group are executed
averaged over both program types and computer architecture. These figures relate to
early 8-bit processors.
Table 1 Instruction usage as a function of instruction type

Instruction Group             1           2        3          4        5        6          7      8
Mean value of instruction use 45.28       28.73    10.75      5.92     3.91     2.93       2.05   0.4

These eight instruction groups in table 1 are:

             Data movement
             Program flow control (i.e., branch, call, return)
             Arithmetic
             Compare
             Logical
             Shift
             Bit manipulation
             Input/output and miscellaneous

Table 1 convincingly demonstrates that the most common instruction type is the data
movement primitive of the form P: = Q in a high-level language or MOVE P,Q in a
low-level language. Similarly, the program flow control group that includes both
conditional and unconditional branches (together with subroutine calls and returns)
forms the second most common group of instructions. Taken together, the data
movement and program flow control groups account for 74% of all instructions. A
corollary of this statement is that we can expect a large program to contain only 26%
of instructions that are not data movement or program flow control primitives.

An inescapable inference from such results is that processor designers might be better
employed devoting their time to optimizing the way in which machines handle
instructions in groups one and two, than in seeking new powerful instructions that are
seldom used. In the early days of the microprocessor, chip manufacturers went out of
their way to provide special instructions that were unique to their products. These
instructions were then heavily promoted by the company's sales force. Today, we can
see that their efforts should have been directed towards the goal of optimizing the
most frequently used instructions. RISC architectures have been designed to exploit
the programming environment in which most instructions are data movement or
program control instructions.

Another aspect of computer architecture that was investigated was the optimum size
of literal operands (i.e., constants). Tanenbaum reported the remarkable result that
56% of all constant values lie in the range -15 to +15 and that 98% of all constants lie
in the range -511 to +511. Consequently, the inclusion of a 5-bit constant field in an
instruction would cover over half the occurrences of a literal. RISC architectures have
sufficiently long instruction lengths to include a literal field as part of the instruction
that caters for the majority of literals.

Programs use subroutines heavily, and an effective architecture should optimize the
way in which subroutines are called, parameters passed to and from subroutines, and
workspace allocated to local variables created by subroutines. Research showed that
in 95% of cases twelve words of storage are sufficient for parameter passing and local
storage. A computer with twelve registers should be able to handle all the operands
required by most subroutines without accessing main store. Such an arrangement
would reduces the processor-memory bus traffic associated with subroutine calls.

Characteristics of RISC Architectures
Having described the ingredients that go into an efficient architecture, we now look at
the attributes of first generation RISCs before covering RISC architectures in more
detail. The characteristics of an efficient RISC architecture are:

RISC processors have sufficient on-chip registers to overcome the worst effects of the
processor-memory bottleneck. Registers can be accessed more rapidly than off-chip
main store. Although today's processors rely heavily on fast on-chip cache memory to
increase throughput, registers still offer the highest performance.

RISC processors have three-address, register-to-register architectures with instructions
in the form OPERATION Ra,Rb,Rc, where Ra, Rb, and Rc are general-purpose
registers.

Because subroutines calls are so frequently executed, (some) RISC architectures make
provision for the efficient passing of parameters between subroutines.

Instructions that modify the flow of control (e.g., branch instructions) are
implemented efficiently because they comprise about 20 to 30% of a typical program.

RISC processors aim to execute one instruction per clock cycle. This goal imposes a
limit on the maximum complexity of instructions.

RISC processors don't attempt to implement infrequently used instructions. Complex
instructions waste silicon real-estate and conflict with the requirements of point 8.
Moreover, the inclusion of complex instructions increases the time taken to design,
fabricate and test a processor.
A corollary of point 5 is that an efficient architecture should not be
microprogrammed, because microprogramming interprets a machine-level instruction
by executing microinstructions. In the limit, a RISC processor is close to a
microprogrammed architecture in which the distinction between machine cycle and
microcode has vanished.

An efficient processor should have a single instruction format (or at least very few
formats). A typical CISC processor such as the 68000 has variable-length instructions
(e.g., from 2 to 10 bytes). By providing a single instruction format, the decoding of a
RISC instruction into its component fields can be performed by a minimum level of
decoding logic. It follows that a RISC's instruction length should be sufficient to
accommodate the operation code field and one or more operand fields. Consequently,
a RISC processor may not utilize memory space as efficiently as does a conventional
CISC microprocessor.

Two fundamental aspects of the RISC architecture that we cover later are its register
set and the use of pipelining. Multiple overlapping register windows were
implemented by the Berkeley RISC to reduce the overhead incurred by transferring
parameters between subroutines. Pipelining is a mechanism that permits the
overlapping of instruction execution (i.e., internal operations are carried out in
parallel). Many of the features of RISC processors are not new, and have been
employed long before the advent of the microprocessor. The RISC revolution
happened when all these performance-enhancing techniques were brought together
and applied to microprocessor design.

The Berkeley RISC
Although many CISC processors were designed by semiconductor manufacturers, one
of the first RISC processors came from the University of California at Berkeley. The
Berkeley RISC wasn't a commercial machine, although it had a tremendous impact on
the development of later RISC architectures. Figure 1 describes the format of a
Berkeley RISC instruction. Each of the 5-bit operand fields (Destination, Source 1,
Source 2) permits one of 32 internal registers to be accessed.

Figure 1 Format of the Berkeley RISC instruction
The single-bit set condition code field, Scc, determines whether the condition code
bits are updated after the execution of an instruction. The 14-bit Source 2 field has
two functions. If the IM bit (immediate) is 0, the Source 2 field specifies one of 32
registers. If the IM bit is 1, the Source 2 field provide a 13-bit literal operand.

Since five bits are allocated to each operand field, it follows that this RISC has 25 =
32 internal registers. This last statement is emphatically not true, since the Berkeley
RISC has 138 user-accessible general-purpose internal registers. The reason for the
discrepancy between the number of registers directly addressable and the actual
number of registers is due to a mechanism called windowing that gives the
programmer a view of only a subset of all registers at any instant. Register R0 is
hardwired to contain the constant zero. Specifying R0 as an operand is the same as
specifying the constant 0.

Register Windows
An important feature of the Berkeley RISC architecture is the way in which it
allocates new registers to subroutines; that is, when you call a subroutine, you get
some new registers. If you can create 12 registers out of thin air when you call a
subroutine, each subroutine will have its own workspace for temporary variables,
thereby avoiding relatively slow accesses to main store.
Although only 12 or so registers are required by each invocation of a subroutine, the
successive nesting of subroutines rapidly increases the total number of on-chip
registers assigned to subroutines. You might think that any attempt to dedicate a set of
registers to each new procedure is impractical, because the repeated calling of nested
subroutines will require an unlimited amount of storage. Subroutines can indeed be
nested to any depth, but research has demonstrated that on average subroutines are not
nested to any great depth over short periods. Consequently, it is feasible to adopt a
modest number of local register sets for a sequence of nested subroutines.

Figure 2 provides a graphical representation of the execution of a typical program in
terms of the depth of nesting of subroutines as a function of time. The trace goes up
each time a subroutine is called and down each time a return is made. If subroutines
were never called, the trace would be a horizontal line. This figure demonstrates is
that even though subroutines may be nested to considerable depths, there are periods
or runs of subroutine calls and returns that do not require a nesting level of greater
than about five.

Figure 2 Depth of subroutine nesting as a function of time
A mechanism for implementing local variable work space for subroutines adopted by
the designers of the Berkeley RISC is to support up to eight nested subroutines by
providing on-chip work space for each subroutine. Any further nesting forces the CPU
to dump registers to main memory, as we shall soon see.

Memory space used by subroutines can be divided into four types:

Global space Global space is directly accessible by all subroutines and holds
constants and data that may be required from any point within the program. Most
conventional microprocessors have only global registers.
Local space Local space is private to the subroutine. That is, no other subroutine can
access the current subroutine's local address space from outside the subroutine. Local
space is employed as working space by the current subroutine.

Imported parameter space Imported parameter space holds the parameters imported
by the current subroutine from its parent that called it. In Berkeley RISC terminology
these are called the high registers.

Exported parameter space Exported parameter space holds the parameters exported
by the current subroutine to its child. In RISC terminology these are called the low
registers.




Windows and Parameter Passing
One of the reasons for the high frequency of data movement operations is the need to
pass parameters to subroutines and to receive them from subroutines.

The Berkeley RISC architecture deals with parameter passing by means of multiple
overlapped windows. A window is the set of registers visible to the current subroutine.
Figure 3 illustrates the structure of the Berkeley RISC's overlapping windows. Only
three consecutive windows (i-1, i, i+1) of the 8 windows are shown in Figure 3. The
vertical columns represent the registers seen by the corresponding window. Each
window sees 32 registers, but they aren't all the same 32 registers.

The Berkeley RISC has a special-purpose register called the window pointer, WP, that
indicates the current active window. Suppose that the processor is currently using
the ith window set. In this case the WP contains the value i. The registers in each of
the 8 windows are divided into four groups shown in Table 2.

Table 2 Berkeley RISC register types

Register name           Register type
R0 to R9                The global register set is always accessible.
R10 to R15              Six registers used by the subroutine to receive parameters from its parent an
                        parent.
R16 to R25              Ten local registers accessed only by the current subroutine that cannot be ac
                        subroutine.
R26 to R31              Six registers used by the subroutine to pass parameters to and from its own
                        called by itself).
All windows consist of 32 addressable registers, R0 to R31. A Berkeley RISC
instruction of the form ADD R3,R12,R25 implements [R25] [R3] + [R12], where
R3 lies within the window's global address space, R12 lies within its import from (or
export to) parent subroutine space, and R25 lies within its local address space. RISC
arithmetic and logical instructions always involve 32-bit values (there are no 8-bit or
16-bit operations).

The Berkeley RISC's subroutine call is CALL Rd,<address> and is similar to a
typical CISC instruction BSR <address>. Whenever a subroutine is invoked
by CALLR Rd,<address>, the contents of the window pointer are incremented by
1 and the current value of the program counter saved in register Rd of the new
window. The Berkeley RISC doesn't employ a conventional stack in external main
memory to save subroutine return addresses.

Figure 3 Berkeley windowed register sets
Once a new window has been invoked (in Figure 3 this is window i), the new
subroutine sees a different set of registers to the previous window. Global registers R0
to R9 are an exception because they are common to all windows. Window R10 of the
child (i.e., called) subroutine corresponds to (i.e., is the same as) window R26 of the
calling (i.e., parent) subroutine. Suppose you wish to send a parameter to a subroutine.
If the parameter is in R10 and you call a subroutine, register R26 in this subroutine
will contain the parameter. There hasn't been a physical transfer of data because
register R26 in the current window is simply register R10 in the previous window.

Figure 4 Relationship between register number, window number, and register address
The physical arrangement of the Berkeley RISC's window system is given in Figure 4.
On the left hand side of the diagram is the actual register array that holds all the on-
chip general-purpose registers. The eight columns associated with windows 0 to 7
demonstrate how each window is mapped onto the physical memory array on the chip
and how the overlapping regions are organized. The windows are logically arranged
in a circular fashion so that window 0 follows window 7 and window 7 precedes
window 0. For example, if the current window pointer is 3 and you access register
R25, location 74 is accessed in the register file. However, if you access register R25
when the window pointer is 7, you access location 137.

The total number of physical registers required to implement the Berkeley windowed
register set is:

10 global + 8 x 10 local + 8 x 6 parameter transfer registers = 138 registers.

Window Overflow

Unfortunately, the total quantity of on-chip resources of any processor is finite and, in
the case of the Berkeley RISC, the registers are limited to 8 windows. If subroutines
are nested to a depth greater than or equal to 7, window overflow is said to occur, as
there is no longer a new window for the next subroutine invocation. When an
overflow takes place, the only thing left to do is to employ external memory to hold
the overflow data. In practice the oldest window is saved rather than the new window
created by the subroutine just called.

If the number of subroutine returns minus the number of subroutine calls exceeds 8,
window underflow takes place. Window underflow is the converse of window
overflow and the youngest window saved in main store must be returned to a window.

A considerable amount of research was carried out into dealing with window overflow
efficiently. However, the imaginative use of windowed register sets in the Berkeley
RISC was not adopted by many of the later RISC architectures. Modern RISC
generally have a single set of 32 general-purpose registers.

RISC Architecture and Pipelining
We now describe pipelining, one of the most important techniques for increasing the
throughput of a digital system that uses the regular structure of a RISC to carry out
internal operations in parallel.
Figure 5 illustrates the machine cycle of a hypothetical microprocessor executing an
ADD P instruction (i.e., [A] [R] + [M(P)], where A is an on-chip general purpose
register and P is a memory location. The instruction is executed in five phases:

Instruction fetch Read the instruction from the system memory and increment the
program counter.

Instruction decode Decode the instruction read from memory during the previous
phase. The nature of the instruction decode phase is dependent on the complexity of
the instruction encoding. A regularly encoded instruction might be decoded in a few
nanoseconds with two levels of gating whereas a complex instruction format might
require ROM-based look-up tables to implement the decoding.

Operand fetch The operand specified by the instruction is read from the system
memory or an on-chip register and loaded into the CPU.

Execute The operation specified by the instruction is carried out.

Operand store The result obtained during the execution phase is written into the
operand destination. This may be an on-chip register or a location in external memory.

Figure 5 Instruction Execution




Each of these five phases may take a specific time (although the time taken would
normally be an integer multiple of the system's master clock period). Some
instructions require less than five phases; for example, CMP R1,R2 compares R1 and
R2 by subtracting R1 from R2 to set the condition codes and does not need an operand
store phase.

The inefficiency in the arrangement of Figure 5 is immediately apparent. Consider
the execution phase of instruction interpretation. This phase might take one fifth of an
instruction cycle leaving the instruction execution unit idle for the remaining 80% of
the time. The same rule applies to the other functional units of the processor, which
also lie idle for 80% of the time. A technique called instruction pipelining can be
employed to increase the effective speed of the processor by overlapping in time the
various stages in the execution of an instruction. In the simplest of terms, a pipelined
processor executes instruction i while fetching instruction i + 1 at the same time.

The way in which a RISC processor implements pipelining is described in Figure 6.
The RISC processor executes the instruction in four steps or phases: instruction fetch
from external memory, operand fetch, execute, and operand store (we're using a 4-
stage system because a separate "instruction decode" phase isn't normally necessary).
The internal phases take approximately the same time as the instruction fetch, because
these operations take place within the CPU itself and operands are fetched from and
stored in the CPU's own register file. Instruction 1 in Figure 6 begins in time slot 1
and is completed at the end of time slot 4.

Figure 6 Pipelining and instruction overlap
In a non-pipelined processor, the next instruction doesn't begin until the current
instruction has been completed. In the pipelined system of Figure 6, the instruction
fetch phase of instruction 2 begins in time slot 2, at the same time that the operand is
being fetched for instruction 1. In time slot 3, different phases of instructions 1, 2, and
3 are being executed simultaneously. In time slot 4, all functional units of the system
are operating in parallel and an instruction is completed in every time slot thereafter.
An n-stage pipeline can increase throughput by up to a factor of n.

Pipeline Bubbles
A pipeline is an ordered structure that thrives on regularity. At any stage in the
execution of a program, a pipeline contains components of two or more instructions at
varying stages in their execution. Consider Figure 7 in which a sequence of
instructions is being executed in a 4-stage pipelined processor. When the processor
encounters a branchinstruction, the following instruction is no longer found at the
next sequential address but at the target address in the branch instruction. The
processor is forced to reload its program counter with the value provided by the
branch instruction. This means that all the useful work performed by the pipeline must
now be thrown away, since the instructions immediately following the branch are not
going to be executed.

When information in a pipeline is rejected or the pipeline is held up by the
introduction of idle states, we say that a bubble has been introduced.

Figure 7 The pipeline bubble caused by a branch
As we have already stated, program control instructions are very frequent.
Consequently, any realistic processor using pipelining must do something to
overcome the problem of bubbles caused by instructions that modify the flow of
control (branch, subroutine call and return). The Berkeley RISC reduces the effect of
bubbles by refusing to throw away the instruction following a branch. This
mechanism is called a delayed jump or a branch-and-execute technique because the
instruction immediately after a branch is always executed. Consider the effect of the
following sequence of instructions:

ADD   R1,R2,R3       [R3]   [R1] + [R2]
JMPX N               [PC]    [N] Goto address N
ADD   R2,R4,R5       [R5]   [R2] + [R4] This is executed
ADD   R7,R8,R9       Not executed because the branch is taken

The processor calculates R5 := R2 + R4 before executing the branch. This sequence of
instructions is most strange to the eyes of a conventional assembly language
programmer, who is not accustomed to seeing an instruction executed after a branch
has been taken.
Unfortunately, it's not always possible to arrange a program in such a way as to
include a useful instruction immediately after a branch. Whenever this happens, the
compiler must introduce a no operation instruction, NOP, after the branch and accept
the inevitability of a bubble. Figure 8 demonstrates how a RISC processor implements
a delayed jump. The branch described in Figure 8 is a computed branch whose target
address is calculated during the execute phase of the instruction cycle.

Figure 8 Delayed branch




Another problem caused by pipelining is data dependency in which certain sequences
of instructions run into trouble because the current operation requires a result from the
previous operation and the previous operation has not yet left the pipeline. Figure 9
demonstrates how data dependency occurs.

Figure 9 Data dependency
Suppose a programmer wishes to carry out the apparently harmless calculation

X := (A + B)AND(A + B - C).

Assuming that A, B, C, X, and two temporary values, T1 and T2, are in registers in
the current window, we can write:

ADD A,B,T1        [T1] [A] + [B]
SUB T1,C,T2       [T2] [T1] - [C]
AND T1,T2,X       [X] [T1] � [T2]

Instruction i + 1 in Figure 9 begins execution during the operand fetch phase of the
previous instruction. However, instruction i + 1 cannot continue on to its operand
fetch phase, because the very operand it requires does not get written back to the
register file for another two clock cycles. Consequently a bubble must be introduced
in the pipeline while instruction i + 1 waits for its data. In a similar fashion, the logical
AND operation also introduces a bubble as it too requires the result of a previous
operation which is in the pipeline.

Figure 10 demonstrates a technique called internal forwarding designed to overcome
the effects of data dependency. The following sequence of operations is to be
executed.

   ADD
1.                     [R3] [R1] + [R2]
   R1,R2,R3
   ADD
2.                     [R6]   [R4] + [R5]
   R4,R5,R6
   ADD
3.                     [R7]   [R3] + [R4]
   R3,R4,R7
   ADD
4.                     [R8]   [R7] + [R1]
   R7,R1,R8

Figure 10 Internal forwarding
In this example, instruction 3 (i.e., ADD R3,R4,R7) uses an operand generated by
instruction 1 (i.e., the contents of register R3). Because of the intervening instruction
2, the destination operand generated by instruction 1 has time to be written into the
register file before it is read as a source operand by instruction 3.

Instruction 3 generates a destination operand R7 that is required as a source operand
by the next instruction. If the processor were to read the source operand requested by
instruction 4 from the register file, it would see the old value of R7. By means of
internal forwarding the processor transfers R7 from instruction 3's execution unit
directly to the execution unit of instruction 4 (see Figure 10).

Accessing External Memory in RISC
Systems
Conventional CISC processors have a wealth of addressing modes that are used in
conjunction with memory reference instructions. For example, the 68020 implements
ADD D0,-(A5) which adds the contents of D0 to the top of the stack pointed at by A5
and then pushes the result on to this stack.

In their ruthless pursuit of efficiency, the designers of the Berkeley RISC severely
restricted the way in which it accesses external memory. The Berkeley RISC permits
only two types of reference to external memory: a load and a store. All arithmetic and
logical operations carried out by the RISC apply only to source and destination
operands in registers. Similarly, the Berkeley RISC provides a limited number of
addressing modes with which to access an operand in the main store. It's not hard to
find the reason for these restrictions on external memory accesses—an external
memory reference takes longer than an internal operation. We now discuss some of
the general principles of Berkeley RISC load and store instructions.

Consider the load register operation of the form LDXW (Rx)S2,Rd that has the effect
[Rd] [M([Rx] + S2)]. The operand address is the contents of the memory location
pointed at by register Rx plus offset S2. Figure 11 demonstrates the sequence of
actions performed during the execution of this instruction. During the source fetch
phase, register Rx is read from the register file and used to calculate the effective
address of the operand in the execute phase. However, the processor can't progress
beyond the execute phase to the store operand phase, because the operand hasn't been
read from the main store. Therefore the main store must be accessed to read the
operand and a store operand phase executed to load the operand into destination
register Rd. Because memory accesses introduce bubbles into the pipeline, they are
avoided wherever possible.

Figure 11 The load operation
The Berkeley RISC implements two basic addressing modes: indexed and program
counter relative. All other addressing modes can (and must) be synthesized from these
two primitives. The effective address in the indexed mode is given by:

EA = [Rx] + S2

where Rx is the index register (one of the 32 general purpose registers accessible by
the current subroutine) and S2 is an offset. The offset can be either a general-purpose
register or a 13-bit constant.

The effective address in the program counter relative mode is given by:

EA = [PC] + S2

where PC represents the contents of the program counter and S2 is an offset as above.

These addressing modes include quite a powerful toolbox: zero, one or two pointers
and a constant offset. If you wonder how we can use an addressing mode without an
index (i.e., pointer) register, remember that R0 in the global register set permanently
contains the constant 0. For example, LDXW (R12)R0,R3 uses simple address
register indirect addressing, whereas LDXW (R0)123,R3 uses absolute addressing
(i.e., memory location 123).

There's a difference between addressing modes permitted by load and store
operations. A load instruction permits the second source, S2, to be either an
immediate value or a second register, whereas a store instruction permits S2 to be a
13-bit immediate value only. This lack of symmetry between the load and store
addressing modes is because a "load base+index" instruction requires a register file
with two ports, whereas a "store base+index" instruction requires a register file
with three ports. Two-ported memory allows two simultaneous accesses. Three-ported
memory allows three simultaneous accesses and is harder to design.

Figure 1 defines just two basic Berkeley RISC instruction formats. The short
immediate format provides a 5-bit destination, a 5-bit source 1 operand and a 14-bit
short source 2 operand. The short immediate format has two variations: one that
specifies a 13-bit literal for source 2 and one that specifies a 5-bit source 2 register
address. Bit 13 specifies whether the source 2 operand is a 13-bit literal or a 5 bit
register pointer.

The long immediate format provides a 19-bit source operand by concatenating the two
source operand fields. Thirteen-bit and 19-bit immediate fields may sound a little
strange at first sight. However, since 13 + 19 = 32, the Berkeley RISC permits a full
32-bit value to be loaded into a window register in two operations. In the next section
we will discover that the ARM processor deals with literals in a different way. A
typical CISC microprocessor might take the same number of instruction bits to
perform the same action (i.e., a 32-bit operation code field followed by a 32-bit
literal).

The following describes some of the addressing modes that can be synthesized from
the RISC's basic addressing modes.

1. Absolute addressing
EA = 13-bit offset
Implemented by setting Rx = R0 = 0, S2 = 13-bit constant.

2. Register indirect
EA = [Rx]
Implemented by setting S2 = R0 = 0.

3. Indexed addressing
EA = [Rx] + Offset
Implemented by setting S2 = 13-bit constant.

4. Two-dimensional byte addressing (i.e., byte array access)
EA = [Rx] + [Ry]
Implemented by setting S2 = [Ry].
This mode is available only for load instructions.

Conditional instructions (i.e., branch operations) do not require a destination address
and therefore the five bits, 19 to 23, normally used to specify a destination register are
used to specify the condition (one of 16 since bit 23 is not used by conditional
instructions).

Reducing the Branch Penalty
If we're going to reduce the effect of branches on the performance of RISC
processors, we need to determine the effect of branch instructions on the performance
of the system. Because we cannot know how many branches a given program will
contain, or how likely each branch is to be taken, we have to construct a probabilistic
model to describe the system's performance. We will make the following assumptions:

1. Each non-branch instruction is executed in one cycle
2. The probability that a given instruction is a branch is pb
3. The probability that a branch instruction will be taken is pt
4. If a branch is taken, the additional penalty is b cycles
If a branch is not taken, there is no penalty
If pb is the probability that an instruction is a branch, 1 - pb is the probability that it is
not a branch

The average number of cycles executed during the execution of a program is the sum
of the cycles taken for non-branch instructions, plus the cycles taken by branch
instructions that are taken, plus the cycles taken by branch instructions that are not
taken. We can derive an expression for the average number of cycles per instruction
as:

Tave = (1 - pb)�1 + pb�pt� (1 + b) + pb� (1 - pt) �1 = 1 + pb�pt�b.

This expression, 1 + pb�pt�b, tells us that the number of branch instructions, the
probability that a branch is taken, and the overhead per branch instruction all
contribute to the branch penalty. We are now going to examine some of the ways in
which the value of pb�pt�b can be reduced.

Branch Prediction
If we can predict the outcome of the branch instruction before it is executed, we can
start filling the pipeline with instructions from the branch target address (assuming the
branch is going to be taken). For example, if the instruction is BRA N, the processor
can start fetching instructions at locations N, N + 1, N + 2 etc., as soon as the branch
instruction is fetched from memory. In this way, the pipeline is always filled with
useful instructions.

This prediction mechanism works well with an unconditional branch like BRA N.
Unfortunately, conditional branches pose a problem. Consider a conditional branch of
the form BCC N (branch to N on carry bit clear). Should the RISC processor make the
assumption that the branch will not be taken and fetch instructions in sequence, or
should it make the assumption that the branch will be taken and fetch instruction at
the branch target address N?

As we have already said, conditional branches are required to implement various
types of high-level language construct. Consider the following fragment of high-level
language code.

if (J < K) I = I + L;
(for T = 1; T <= I; T++)
{
.
.
}

The first conditional operation compares J with K. Only the nature of the problem will
tell us whether J is often less than K.

The second conditional in this fragment of code is provided by the FOR construct that
tests a counter at the end of the loop and then decides whether to jump back to the
body of the construct or to terminate to loop. In this case, you could bet that the loop
is more likely to be repeated than exited. Loops can be executed thousands of times
before they are exited. Some computers look at the type of conditional branch and
then either fill the pipeline from the branch target if you think that the branch will be
taken, or fill the pipeline from the instruction after the branch if you think that it will
not be taken.

If we attempt to predict the behavior of a system with two outcomes (branch taken or
branch not taken), there are four possibilities:

1. Predict branch taken and branch taken — successful outcome
2. Predict branch taken and branch not taken — unsuccessful outcome
3. Predict branch not taken and branch not taken — successful outcome
4. Predict branch not taken and branch taken — unsuccessful outcome

Suppose we apply a branch penalty to each of these four possible outcomes. The
penalty is the number of cycles taken by that particular outcome, as table 3
demonstrates. For example, if we think that a branch will not be taken and get
instructions following the branch and the branch is actually taken (forcing the pipeline
to be loaded with instructions at the target address), the branch penalty in table 3
is c cycles.

Table 3 The branch penalty

Prediction            Result                 Branch penalty
Branch taken          Branch taken           a
Branch taken          Branch not taken       b
Branch not taken      Branch taken           c
Branch not taken      Branch not taken       d

We can now calculate the average penalty for a particular system. To do this we need
more information about the system. The first thing we need to know is the probability
that an instruction will be a branch (as opposed to any other category of instruction).
Assume that the probability that an instruction is a branch is pb. The next thing we
need to know is the probability that the branch instruction will be taken, pt. Finally,
we need to know the accuracy of the prediction. Let pc be the probability that a branch
prediction is correct. These values can be obtained by observing the performance of
real programs. Figure 12 illustrates all the possible outcomes of an instruction. We
can immediately write:

(1 - pb) = probability that an instruction is not a branch.
(1 - pt) = probability that a branch will not be taken.
(1 - pc) = probability that a prediction is incorrect.

These equations are obtained by using the principle that if one event or another must
take place, their probabilities must add up to unity. The average branch penalty per
branch instruction is therefore

Cave = a � (pbranch_predicted_taken_and_taken) + b � (pbranch_predicted_taken_but_not_taken)

+ c � (pbranch_predicted_not_taken_but_taken) + d � (pbranch_predicted_not_taken_and_not_taken)

Cave = a � (pt � pc) + b� (1 - pt) � (1 - pc) + c� pt � (1 - pc) + d � (1 - pt) � pc

Figure 12 Branch prediction
The average number of cycles added due to a branch instruction is Cave � pb

= pb � (a � pt � pc + b � (1 - pt) � (1 - pc) + c � pt � (1 - pc) + d � (1 - pt) � pc).

We can make two assumptions to help us to simplify this general expression. The first
is that a = d = N (i.e., if the prediction is correct the number of cycles is N). The other
simplification is that b = c = B (i.e., if the prediction is wrong the number of cycles
is B). The average number of cycles per branch instruction is therefore:

pb � (N � pt � pc + B � pt � (1 - pc) + B � (1 - pt) � (1 - pc) + N � (1 - pt) � pc)
= pb � (N � pc + B � (1 - pc)).

This formula can be used to investigate tradeoffs between branch penalties, branch
probabilities and pipeline length. There are several ways of implementing branch
prediction (i.e., increasing the value of pc). Two basic approaches are static branch
prediction and dynamic branch prediction. Static branch prediction makes the
assumption that branches are always taken or never taken. Since observations of real
code have demonstrated that branches have a greater than 50% chance of being taken,
the best static branch prediction mechanism would be to fetch the next instruction
from the branch target address as soon as the branch instruction is detected.

A better method of predicting the outcome of a branch is by observing its op-code,
because some branch instructions are taken more or less frequently that other branch
instructions. Using the branch op-code to predict that the branch will or will not be
taken results in 75% accuracy. An extension of this technique is to devote a bit of the
op-code to the static prediction of branches. This bit is set or cleared by the compiler
depending on whether the compiler estimates that the branch is most likely to be
taken. This technique provides branch prediction accuracy in the range 74 to 94%.

Dynamic branch prediction techniques operate at runtime and use the past behavior of
the program to predict its future behavior. Suppose the processor maintains a table of
branch instructions. This branch table contains information about the likely behavior
of each branch. Each time a branch is executed, its outcome (i.e., taken or not taken is
used to update the entry in the table. The processor uses the table to determine
whether to take the next instruction from the branch target address (i.e., branch
predicted taken) or from the next address in sequence (branch predicted not taken).

Single-bit branch predictors provide an accuracy of over 80 percent and five-bit
predictors provide an accuracy of up to 98 percent. A typical branch prediction
algorithm uses the last two outcomes of a branch to predict its future. If the last two
outcomes are X, the next branch is assumed to lead to outcome X. If the prediction is
wrong it remains the same the next time the branch is executed (i.e., two failures are
needed to modify the prediction). After two consecutive failures, the prediction is
inverted and the other outcome assumed. This algorithm responds to trends and is not
affected by the occasional single different outcome.

Problems
1. What are the characteristics of a CISC processor?

2. The most frequently executed class of instruction is the data move instruction. Why
is this?

3. The Berkeley RISC has a 32-bit architecture and yet provides only a 13-bit literal.
Why is this and does it really matter?
4. What are the advantages and disadvantages of register windowing?

5. What is pipelining and how does it increase the performance of a computer?

6. A pipeline is defined by its length (i.e., the number of stages that can operate in
parallel). A pipeline can be short or long. What do you think are the relative
advantages of longs and short pipelines?

7. What is data dependency in a pipelined system and how can its effects be
overcome?

8. RISC architectures don't permit operations on operands in memory other than load
and store operations. Why?

9. The average number of cycles required by a RISC to execute an instruction is given
by Tave = 1 + pb�pt�b.

where

The probability that a given instruction is a branch is pb
The probability that a branch instruction will be taken is pt
If a branch is taken, the additional penalty is b cycles
If a branch is not taken, there is no penalty
Draw a series of graphs of the average number of cycles per instruction as a function
of pb�pt for b = 1, 2, 3, and 4.

10. What is branch prediction and how can it be used to reduce the so-called branch
penalty in a pipelined system?

Más contenido relacionado

La actualidad más candente

Comparative Study of RISC AND CISC Architectures
Comparative Study of RISC AND CISC ArchitecturesComparative Study of RISC AND CISC Architectures
Comparative Study of RISC AND CISC ArchitecturesEditor IJCATR
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlowChaudhary Manzoor
 
RISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingRISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingTushar Swami
 
Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) A B Shinde
 
Introducing Embedded Systems and the Microcontrollers
Introducing Embedded Systems and the MicrocontrollersIntroducing Embedded Systems and the Microcontrollers
Introducing Embedded Systems and the MicrocontrollersRavikumar Tiwari
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsJose Pinilla
 
CISC vs RISC Processor Architecture
CISC vs RISC Processor ArchitectureCISC vs RISC Processor Architecture
CISC vs RISC Processor ArchitectureKaushik Patra
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotSlide_N
 
Risc and cisc casestudy
Risc and cisc casestudyRisc and cisc casestudy
Risc and cisc casestudyjvs71294
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlowManish Prajapati
 
Review paper on 32-BIT RISC processor with floating point arithmetic
Review paper on 32-BIT RISC processor with floating point arithmeticReview paper on 32-BIT RISC processor with floating point arithmetic
Review paper on 32-BIT RISC processor with floating point arithmeticIRJET Journal
 
Advanced Scalable Decomposition Method with MPICH Environment for HPC
Advanced Scalable Decomposition Method with MPICH Environment for HPCAdvanced Scalable Decomposition Method with MPICH Environment for HPC
Advanced Scalable Decomposition Method with MPICH Environment for HPCIJSRD
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORVLSICS Design
 
Dsdco IE: RISC and CISC architectures and design issues
Dsdco IE: RISC and CISC architectures and design issuesDsdco IE: RISC and CISC architectures and design issues
Dsdco IE: RISC and CISC architectures and design issuesHome
 
IRJET- Design of Low Power 32- Bit RISC Processor using Verilog HDL
IRJET-  	  Design of Low Power 32- Bit RISC Processor using Verilog HDLIRJET-  	  Design of Low Power 32- Bit RISC Processor using Verilog HDL
IRJET- Design of Low Power 32- Bit RISC Processor using Verilog HDLIRJET Journal
 

La actualidad más candente (20)

Tibor
TiborTibor
Tibor
 
Comparative Study of RISC AND CISC Architectures
Comparative Study of RISC AND CISC ArchitecturesComparative Study of RISC AND CISC Architectures
Comparative Study of RISC AND CISC Architectures
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlow
 
RISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingRISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set Computing
 
Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism)
 
Risc & cisk
Risc & ciskRisc & cisk
Risc & cisk
 
Introducing Embedded Systems and the Microcontrollers
Introducing Embedded Systems and the MicrocontrollersIntroducing Embedded Systems and the Microcontrollers
Introducing Embedded Systems and the Microcontrollers
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) Limitations
 
CISC vs RISC Processor Architecture
CISC vs RISC Processor ArchitectureCISC vs RISC Processor Architecture
CISC vs RISC Processor Architecture
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
 
Risc and cisc casestudy
Risc and cisc casestudyRisc and cisc casestudy
Risc and cisc casestudy
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlow
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
Review paper on 32-BIT RISC processor with floating point arithmetic
Review paper on 32-BIT RISC processor with floating point arithmeticReview paper on 32-BIT RISC processor with floating point arithmetic
Review paper on 32-BIT RISC processor with floating point arithmetic
 
Advanced Scalable Decomposition Method with MPICH Environment for HPC
Advanced Scalable Decomposition Method with MPICH Environment for HPCAdvanced Scalable Decomposition Method with MPICH Environment for HPC
Advanced Scalable Decomposition Method with MPICH Environment for HPC
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
 
Dsdco IE: RISC and CISC architectures and design issues
Dsdco IE: RISC and CISC architectures and design issuesDsdco IE: RISC and CISC architectures and design issues
Dsdco IE: RISC and CISC architectures and design issues
 
ITFT_Risc
ITFT_RiscITFT_Risc
ITFT_Risc
 
IRJET- Design of Low Power 32- Bit RISC Processor using Verilog HDL
IRJET-  	  Design of Low Power 32- Bit RISC Processor using Verilog HDLIRJET-  	  Design of Low Power 32- Bit RISC Processor using Verilog HDL
IRJET- Design of Low Power 32- Bit RISC Processor using Verilog HDL
 

Similar a Risc processors all syllabus5

A 64-Bit RISC Processor Design and Implementation Using VHDL
A 64-Bit RISC Processor Design and Implementation Using VHDL A 64-Bit RISC Processor Design and Implementation Using VHDL
A 64-Bit RISC Processor Design and Implementation Using VHDL Andrew Yoila
 
Performance from Architecture: Comparing a RISC and a CISC with Similar Hardw...
Performance from Architecture: Comparing a RISC and a CISC with Similar Hardw...Performance from Architecture: Comparing a RISC and a CISC with Similar Hardw...
Performance from Architecture: Comparing a RISC and a CISC with Similar Hardw...Dileep Bhandarkar
 
Microcontroller pic 16f877 architecture and basics
Microcontroller pic 16f877 architecture and basicsMicrocontroller pic 16f877 architecture and basics
Microcontroller pic 16f877 architecture and basicsNilesh Bhaskarrao Bahadure
 
Architectures and operating systems
Architectures and operating systemsArchitectures and operating systems
Architectures and operating systemsHiran Kanishka
 
RISC Vs CISC, Harvard v/s Van Neumann
RISC Vs CISC, Harvard v/s Van NeumannRISC Vs CISC, Harvard v/s Van Neumann
RISC Vs CISC, Harvard v/s Van NeumannRavikumar Tiwari
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlowkaran saini
 
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdfCS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdfAsst.prof M.Gokilavani
 
Computer Organization.pptx
Computer Organization.pptxComputer Organization.pptx
Computer Organization.pptxsaimagul310
 
A New Direction for Computer Architecture Research
A New Direction for Computer Architecture ResearchA New Direction for Computer Architecture Research
A New Direction for Computer Architecture Researchdbpublications
 
Crussoe proc
Crussoe procCrussoe proc
Crussoe proctyadi
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Ismail Mukiibi
 
Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH veena babu
 
risc_and_cisc.ppt
risc_and_cisc.pptrisc_and_cisc.ppt
risc_and_cisc.pptRuhul Amin
 

Similar a Risc processors all syllabus5 (20)

A 64-Bit RISC Processor Design and Implementation Using VHDL
A 64-Bit RISC Processor Design and Implementation Using VHDL A 64-Bit RISC Processor Design and Implementation Using VHDL
A 64-Bit RISC Processor Design and Implementation Using VHDL
 
Performance from Architecture: Comparing a RISC and a CISC with Similar Hardw...
Performance from Architecture: Comparing a RISC and a CISC with Similar Hardw...Performance from Architecture: Comparing a RISC and a CISC with Similar Hardw...
Performance from Architecture: Comparing a RISC and a CISC with Similar Hardw...
 
Microcontroller pic 16f877 architecture and basics
Microcontroller pic 16f877 architecture and basicsMicrocontroller pic 16f877 architecture and basics
Microcontroller pic 16f877 architecture and basics
 
Architectures and operating systems
Architectures and operating systemsArchitectures and operating systems
Architectures and operating systems
 
Ca alternative architecture
Ca alternative architectureCa alternative architecture
Ca alternative architecture
 
RISC Vs CISC, Harvard v/s Van Neumann
RISC Vs CISC, Harvard v/s Van NeumannRISC Vs CISC, Harvard v/s Van Neumann
RISC Vs CISC, Harvard v/s Van Neumann
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlow
 
Hg3612911294
Hg3612911294Hg3612911294
Hg3612911294
 
Architectures
ArchitecturesArchitectures
Architectures
 
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdfCS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
 
Computer Organization.pptx
Computer Organization.pptxComputer Organization.pptx
Computer Organization.pptx
 
A New Direction for Computer Architecture Research
A New Direction for Computer Architecture ResearchA New Direction for Computer Architecture Research
A New Direction for Computer Architecture Research
 
Ef35745749
Ef35745749Ef35745749
Ef35745749
 
Crussoe proc
Crussoe procCrussoe proc
Crussoe proc
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6
 
R&amp;c
R&amp;cR&amp;c
R&amp;c
 
arm-cortex-a8
arm-cortex-a8arm-cortex-a8
arm-cortex-a8
 
Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH
 
risc_and_cisc.ppt
risc_and_cisc.pptrisc_and_cisc.ppt
risc_and_cisc.ppt
 
11 2014
11 201411 2014
11 2014
 

Último

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 

Último (20)

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 

Risc processors all syllabus5

  • 1. An Introduction to RISC Processors We are going to describe how microprocessor manufacturers took a new look at processor architectures in the 1980s and started designing simpler but faster processors. We begin by explaining why chip designers turned their backs on the conventional complex instruction set computer (CISC) such at the 68K and the Intel 86X families and started producingreduced instruction set computers (RISCs) such as MIPS and the PowerPC. RISC processors have simpler instruction sets than CISC processors (although this is a rather crude distinction between these families, as we shall soon see). By the mid 90s many of these so-called RISC processors were considerably more complex than some of the CISCs they replaced. That isn't a paradox. The RISC processor isn't really a cut-down computer architecture—it represents a new approach to architecture design. In fact, the distinction between CISC and RISC is now so blurred that virtually all processors now have both RISC and CISC features. The RISC Revolution Before we look at the ARM, we describe the history and characteristics of RISC architecture. From the introduction of the microprocessor in the 1970s to the mid 1980's there seems to have been an almost unbroken trend towards more and more complex (you might even say Baroque) architectures. Some of these architectures developed rather like a snowball rolling downhill. Each advance in chip fabrication technology allowed designers to add more and more layers to the microprocessor's central core. Intel's 8086 family illustrates this trend particularly well, because Intel took their original 16-bit processor and added more features in each successive generation. This approach to chip design leads to cumbersome architectures and inefficient instruction sets, but it has the tremendous commercial advantage that end users don't have to pay for new software when they buy the latest reincarnation of a microprocessor. A reaction against the trend toward greater architectural complexity began at IBM with their 801 architecture and continued at Berkeley where Patterson and Ditzel coined the term RISC to describe a new class of architectures that reversed earlier trends in microcomputer design. According to popular wisdom RISC architectures are streamlined versions of traditional complex instruction set computers. This notion is
  • 2. both misleading and dangerous, because it implies that RISC processors are in some way cruder versions of existing architectures. In brief, RISC architectures re-deploy to better effect some of the silicon real estate used to implement complex instructions and elaborate addressing modes in conventional microprocessors of the 68000 and 8086 generation. The mnemonic "RISC" should really stand for regular instruction set computer. Two factors influencing the architecture of first- and second-generation microprocessors were microprogramming and the desire to help compiler writers by providing ever more complex instruction sets. The latter is called closing the semantic gap (i.e., reducing the difference between high-level and low-level languages). By complex instructions we mean instruction like MOVE 12(A3,D0),D2 and ADD (A6)- ,D3 that carry out multi-step operations in a single machine-level instruction. The MOVE 12(A3,D0),D2 generates an effective address by adding the contents of A3 to the contents of D0 plus the literal 12. The resulting address is used to access the source operand that is loaded into register D2. Microprogramming achieved its highpoint in the 1970s when ferrite core memory had a long access time of 1 ms or more and semiconductor high-speed random access memory was very expensive. Quite naturally, computer designers used the slow main store to hold the complex instructions that made up the machine-level program. These machine-level instructions are interpreted by microcode in the much faster microprogram control store within the CPU. Today, main stores use semiconductor memory with an access time of 50 ns or less, and most of the advantages of microprogramming have evaporated. Indeed, the goal of a RISC architecture is to execute an instruction in a single machine cycle. A corollary of this statement is that complex instructions can't be executed by RISC architectures. Before we look at RISC architectures, we have to describe some of the research that led to the search for better architectures. Instruction Usage Computer scientists carried out extensive research over a decade or more in the late 1970s into the way in which computers execute programs. Their studies demonstrated that the relative frequency with which different classes of instructions are executed is not uniform and that some types of instruction are executed far more frequently than other types. Fairclough divided machine-level instructions into eight groups according to type and compiled the statistics shown in Table 1. The "mean value of instruction use" gives the percentage of times that instructions in that group are executed averaged over both program types and computer architecture. These figures relate to early 8-bit processors.
  • 3. Table 1 Instruction usage as a function of instruction type Instruction Group 1 2 3 4 5 6 7 8 Mean value of instruction use 45.28 28.73 10.75 5.92 3.91 2.93 2.05 0.4 These eight instruction groups in table 1 are: Data movement Program flow control (i.e., branch, call, return) Arithmetic Compare Logical Shift Bit manipulation Input/output and miscellaneous Table 1 convincingly demonstrates that the most common instruction type is the data movement primitive of the form P: = Q in a high-level language or MOVE P,Q in a low-level language. Similarly, the program flow control group that includes both conditional and unconditional branches (together with subroutine calls and returns) forms the second most common group of instructions. Taken together, the data movement and program flow control groups account for 74% of all instructions. A corollary of this statement is that we can expect a large program to contain only 26% of instructions that are not data movement or program flow control primitives. An inescapable inference from such results is that processor designers might be better employed devoting their time to optimizing the way in which machines handle instructions in groups one and two, than in seeking new powerful instructions that are seldom used. In the early days of the microprocessor, chip manufacturers went out of their way to provide special instructions that were unique to their products. These instructions were then heavily promoted by the company's sales force. Today, we can see that their efforts should have been directed towards the goal of optimizing the most frequently used instructions. RISC architectures have been designed to exploit the programming environment in which most instructions are data movement or program control instructions. Another aspect of computer architecture that was investigated was the optimum size of literal operands (i.e., constants). Tanenbaum reported the remarkable result that 56% of all constant values lie in the range -15 to +15 and that 98% of all constants lie in the range -511 to +511. Consequently, the inclusion of a 5-bit constant field in an instruction would cover over half the occurrences of a literal. RISC architectures have
  • 4. sufficiently long instruction lengths to include a literal field as part of the instruction that caters for the majority of literals. Programs use subroutines heavily, and an effective architecture should optimize the way in which subroutines are called, parameters passed to and from subroutines, and workspace allocated to local variables created by subroutines. Research showed that in 95% of cases twelve words of storage are sufficient for parameter passing and local storage. A computer with twelve registers should be able to handle all the operands required by most subroutines without accessing main store. Such an arrangement would reduces the processor-memory bus traffic associated with subroutine calls. Characteristics of RISC Architectures Having described the ingredients that go into an efficient architecture, we now look at the attributes of first generation RISCs before covering RISC architectures in more detail. The characteristics of an efficient RISC architecture are: RISC processors have sufficient on-chip registers to overcome the worst effects of the processor-memory bottleneck. Registers can be accessed more rapidly than off-chip main store. Although today's processors rely heavily on fast on-chip cache memory to increase throughput, registers still offer the highest performance. RISC processors have three-address, register-to-register architectures with instructions in the form OPERATION Ra,Rb,Rc, where Ra, Rb, and Rc are general-purpose registers. Because subroutines calls are so frequently executed, (some) RISC architectures make provision for the efficient passing of parameters between subroutines. Instructions that modify the flow of control (e.g., branch instructions) are implemented efficiently because they comprise about 20 to 30% of a typical program. RISC processors aim to execute one instruction per clock cycle. This goal imposes a limit on the maximum complexity of instructions. RISC processors don't attempt to implement infrequently used instructions. Complex instructions waste silicon real-estate and conflict with the requirements of point 8. Moreover, the inclusion of complex instructions increases the time taken to design, fabricate and test a processor.
  • 5. A corollary of point 5 is that an efficient architecture should not be microprogrammed, because microprogramming interprets a machine-level instruction by executing microinstructions. In the limit, a RISC processor is close to a microprogrammed architecture in which the distinction between machine cycle and microcode has vanished. An efficient processor should have a single instruction format (or at least very few formats). A typical CISC processor such as the 68000 has variable-length instructions (e.g., from 2 to 10 bytes). By providing a single instruction format, the decoding of a RISC instruction into its component fields can be performed by a minimum level of decoding logic. It follows that a RISC's instruction length should be sufficient to accommodate the operation code field and one or more operand fields. Consequently, a RISC processor may not utilize memory space as efficiently as does a conventional CISC microprocessor. Two fundamental aspects of the RISC architecture that we cover later are its register set and the use of pipelining. Multiple overlapping register windows were implemented by the Berkeley RISC to reduce the overhead incurred by transferring parameters between subroutines. Pipelining is a mechanism that permits the overlapping of instruction execution (i.e., internal operations are carried out in parallel). Many of the features of RISC processors are not new, and have been employed long before the advent of the microprocessor. The RISC revolution happened when all these performance-enhancing techniques were brought together and applied to microprocessor design. The Berkeley RISC Although many CISC processors were designed by semiconductor manufacturers, one of the first RISC processors came from the University of California at Berkeley. The Berkeley RISC wasn't a commercial machine, although it had a tremendous impact on the development of later RISC architectures. Figure 1 describes the format of a Berkeley RISC instruction. Each of the 5-bit operand fields (Destination, Source 1, Source 2) permits one of 32 internal registers to be accessed. Figure 1 Format of the Berkeley RISC instruction
  • 6. The single-bit set condition code field, Scc, determines whether the condition code bits are updated after the execution of an instruction. The 14-bit Source 2 field has two functions. If the IM bit (immediate) is 0, the Source 2 field specifies one of 32 registers. If the IM bit is 1, the Source 2 field provide a 13-bit literal operand. Since five bits are allocated to each operand field, it follows that this RISC has 25 = 32 internal registers. This last statement is emphatically not true, since the Berkeley RISC has 138 user-accessible general-purpose internal registers. The reason for the discrepancy between the number of registers directly addressable and the actual number of registers is due to a mechanism called windowing that gives the programmer a view of only a subset of all registers at any instant. Register R0 is hardwired to contain the constant zero. Specifying R0 as an operand is the same as specifying the constant 0. Register Windows An important feature of the Berkeley RISC architecture is the way in which it allocates new registers to subroutines; that is, when you call a subroutine, you get some new registers. If you can create 12 registers out of thin air when you call a subroutine, each subroutine will have its own workspace for temporary variables, thereby avoiding relatively slow accesses to main store.
  • 7. Although only 12 or so registers are required by each invocation of a subroutine, the successive nesting of subroutines rapidly increases the total number of on-chip registers assigned to subroutines. You might think that any attempt to dedicate a set of registers to each new procedure is impractical, because the repeated calling of nested subroutines will require an unlimited amount of storage. Subroutines can indeed be nested to any depth, but research has demonstrated that on average subroutines are not nested to any great depth over short periods. Consequently, it is feasible to adopt a modest number of local register sets for a sequence of nested subroutines. Figure 2 provides a graphical representation of the execution of a typical program in terms of the depth of nesting of subroutines as a function of time. The trace goes up each time a subroutine is called and down each time a return is made. If subroutines were never called, the trace would be a horizontal line. This figure demonstrates is that even though subroutines may be nested to considerable depths, there are periods or runs of subroutine calls and returns that do not require a nesting level of greater than about five. Figure 2 Depth of subroutine nesting as a function of time
  • 8. A mechanism for implementing local variable work space for subroutines adopted by the designers of the Berkeley RISC is to support up to eight nested subroutines by providing on-chip work space for each subroutine. Any further nesting forces the CPU to dump registers to main memory, as we shall soon see. Memory space used by subroutines can be divided into four types: Global space Global space is directly accessible by all subroutines and holds constants and data that may be required from any point within the program. Most conventional microprocessors have only global registers.
  • 9. Local space Local space is private to the subroutine. That is, no other subroutine can access the current subroutine's local address space from outside the subroutine. Local space is employed as working space by the current subroutine. Imported parameter space Imported parameter space holds the parameters imported by the current subroutine from its parent that called it. In Berkeley RISC terminology these are called the high registers. Exported parameter space Exported parameter space holds the parameters exported by the current subroutine to its child. In RISC terminology these are called the low registers. Windows and Parameter Passing One of the reasons for the high frequency of data movement operations is the need to pass parameters to subroutines and to receive them from subroutines. The Berkeley RISC architecture deals with parameter passing by means of multiple overlapped windows. A window is the set of registers visible to the current subroutine. Figure 3 illustrates the structure of the Berkeley RISC's overlapping windows. Only three consecutive windows (i-1, i, i+1) of the 8 windows are shown in Figure 3. The vertical columns represent the registers seen by the corresponding window. Each window sees 32 registers, but they aren't all the same 32 registers. The Berkeley RISC has a special-purpose register called the window pointer, WP, that indicates the current active window. Suppose that the processor is currently using the ith window set. In this case the WP contains the value i. The registers in each of the 8 windows are divided into four groups shown in Table 2. Table 2 Berkeley RISC register types Register name Register type R0 to R9 The global register set is always accessible. R10 to R15 Six registers used by the subroutine to receive parameters from its parent an parent. R16 to R25 Ten local registers accessed only by the current subroutine that cannot be ac subroutine. R26 to R31 Six registers used by the subroutine to pass parameters to and from its own called by itself).
  • 10. All windows consist of 32 addressable registers, R0 to R31. A Berkeley RISC instruction of the form ADD R3,R12,R25 implements [R25] [R3] + [R12], where R3 lies within the window's global address space, R12 lies within its import from (or export to) parent subroutine space, and R25 lies within its local address space. RISC arithmetic and logical instructions always involve 32-bit values (there are no 8-bit or 16-bit operations). The Berkeley RISC's subroutine call is CALL Rd,<address> and is similar to a typical CISC instruction BSR <address>. Whenever a subroutine is invoked by CALLR Rd,<address>, the contents of the window pointer are incremented by 1 and the current value of the program counter saved in register Rd of the new window. The Berkeley RISC doesn't employ a conventional stack in external main memory to save subroutine return addresses. Figure 3 Berkeley windowed register sets
  • 11.
  • 12. Once a new window has been invoked (in Figure 3 this is window i), the new subroutine sees a different set of registers to the previous window. Global registers R0 to R9 are an exception because they are common to all windows. Window R10 of the child (i.e., called) subroutine corresponds to (i.e., is the same as) window R26 of the calling (i.e., parent) subroutine. Suppose you wish to send a parameter to a subroutine. If the parameter is in R10 and you call a subroutine, register R26 in this subroutine will contain the parameter. There hasn't been a physical transfer of data because register R26 in the current window is simply register R10 in the previous window. Figure 4 Relationship between register number, window number, and register address
  • 13.
  • 14.
  • 15. The physical arrangement of the Berkeley RISC's window system is given in Figure 4. On the left hand side of the diagram is the actual register array that holds all the on- chip general-purpose registers. The eight columns associated with windows 0 to 7 demonstrate how each window is mapped onto the physical memory array on the chip and how the overlapping regions are organized. The windows are logically arranged in a circular fashion so that window 0 follows window 7 and window 7 precedes window 0. For example, if the current window pointer is 3 and you access register R25, location 74 is accessed in the register file. However, if you access register R25 when the window pointer is 7, you access location 137. The total number of physical registers required to implement the Berkeley windowed register set is: 10 global + 8 x 10 local + 8 x 6 parameter transfer registers = 138 registers. Window Overflow Unfortunately, the total quantity of on-chip resources of any processor is finite and, in the case of the Berkeley RISC, the registers are limited to 8 windows. If subroutines are nested to a depth greater than or equal to 7, window overflow is said to occur, as there is no longer a new window for the next subroutine invocation. When an overflow takes place, the only thing left to do is to employ external memory to hold the overflow data. In practice the oldest window is saved rather than the new window created by the subroutine just called. If the number of subroutine returns minus the number of subroutine calls exceeds 8, window underflow takes place. Window underflow is the converse of window overflow and the youngest window saved in main store must be returned to a window. A considerable amount of research was carried out into dealing with window overflow efficiently. However, the imaginative use of windowed register sets in the Berkeley RISC was not adopted by many of the later RISC architectures. Modern RISC generally have a single set of 32 general-purpose registers. RISC Architecture and Pipelining We now describe pipelining, one of the most important techniques for increasing the throughput of a digital system that uses the regular structure of a RISC to carry out internal operations in parallel.
  • 16. Figure 5 illustrates the machine cycle of a hypothetical microprocessor executing an ADD P instruction (i.e., [A] [R] + [M(P)], where A is an on-chip general purpose register and P is a memory location. The instruction is executed in five phases: Instruction fetch Read the instruction from the system memory and increment the program counter. Instruction decode Decode the instruction read from memory during the previous phase. The nature of the instruction decode phase is dependent on the complexity of the instruction encoding. A regularly encoded instruction might be decoded in a few nanoseconds with two levels of gating whereas a complex instruction format might require ROM-based look-up tables to implement the decoding. Operand fetch The operand specified by the instruction is read from the system memory or an on-chip register and loaded into the CPU. Execute The operation specified by the instruction is carried out. Operand store The result obtained during the execution phase is written into the operand destination. This may be an on-chip register or a location in external memory. Figure 5 Instruction Execution Each of these five phases may take a specific time (although the time taken would normally be an integer multiple of the system's master clock period). Some instructions require less than five phases; for example, CMP R1,R2 compares R1 and R2 by subtracting R1 from R2 to set the condition codes and does not need an operand store phase. The inefficiency in the arrangement of Figure 5 is immediately apparent. Consider the execution phase of instruction interpretation. This phase might take one fifth of an instruction cycle leaving the instruction execution unit idle for the remaining 80% of the time. The same rule applies to the other functional units of the processor, which also lie idle for 80% of the time. A technique called instruction pipelining can be employed to increase the effective speed of the processor by overlapping in time the
  • 17. various stages in the execution of an instruction. In the simplest of terms, a pipelined processor executes instruction i while fetching instruction i + 1 at the same time. The way in which a RISC processor implements pipelining is described in Figure 6. The RISC processor executes the instruction in four steps or phases: instruction fetch from external memory, operand fetch, execute, and operand store (we're using a 4- stage system because a separate "instruction decode" phase isn't normally necessary). The internal phases take approximately the same time as the instruction fetch, because these operations take place within the CPU itself and operands are fetched from and stored in the CPU's own register file. Instruction 1 in Figure 6 begins in time slot 1 and is completed at the end of time slot 4. Figure 6 Pipelining and instruction overlap
  • 18. In a non-pipelined processor, the next instruction doesn't begin until the current instruction has been completed. In the pipelined system of Figure 6, the instruction fetch phase of instruction 2 begins in time slot 2, at the same time that the operand is being fetched for instruction 1. In time slot 3, different phases of instructions 1, 2, and 3 are being executed simultaneously. In time slot 4, all functional units of the system are operating in parallel and an instruction is completed in every time slot thereafter. An n-stage pipeline can increase throughput by up to a factor of n. Pipeline Bubbles A pipeline is an ordered structure that thrives on regularity. At any stage in the execution of a program, a pipeline contains components of two or more instructions at varying stages in their execution. Consider Figure 7 in which a sequence of instructions is being executed in a 4-stage pipelined processor. When the processor encounters a branchinstruction, the following instruction is no longer found at the next sequential address but at the target address in the branch instruction. The processor is forced to reload its program counter with the value provided by the branch instruction. This means that all the useful work performed by the pipeline must now be thrown away, since the instructions immediately following the branch are not going to be executed. When information in a pipeline is rejected or the pipeline is held up by the introduction of idle states, we say that a bubble has been introduced. Figure 7 The pipeline bubble caused by a branch
  • 19. As we have already stated, program control instructions are very frequent. Consequently, any realistic processor using pipelining must do something to overcome the problem of bubbles caused by instructions that modify the flow of control (branch, subroutine call and return). The Berkeley RISC reduces the effect of bubbles by refusing to throw away the instruction following a branch. This mechanism is called a delayed jump or a branch-and-execute technique because the instruction immediately after a branch is always executed. Consider the effect of the following sequence of instructions: ADD R1,R2,R3 [R3] [R1] + [R2] JMPX N [PC] [N] Goto address N ADD R2,R4,R5 [R5] [R2] + [R4] This is executed ADD R7,R8,R9 Not executed because the branch is taken The processor calculates R5 := R2 + R4 before executing the branch. This sequence of instructions is most strange to the eyes of a conventional assembly language programmer, who is not accustomed to seeing an instruction executed after a branch has been taken.
  • 20. Unfortunately, it's not always possible to arrange a program in such a way as to include a useful instruction immediately after a branch. Whenever this happens, the compiler must introduce a no operation instruction, NOP, after the branch and accept the inevitability of a bubble. Figure 8 demonstrates how a RISC processor implements a delayed jump. The branch described in Figure 8 is a computed branch whose target address is calculated during the execute phase of the instruction cycle. Figure 8 Delayed branch Another problem caused by pipelining is data dependency in which certain sequences of instructions run into trouble because the current operation requires a result from the previous operation and the previous operation has not yet left the pipeline. Figure 9 demonstrates how data dependency occurs. Figure 9 Data dependency
  • 21. Suppose a programmer wishes to carry out the apparently harmless calculation X := (A + B)AND(A + B - C). Assuming that A, B, C, X, and two temporary values, T1 and T2, are in registers in the current window, we can write: ADD A,B,T1 [T1] [A] + [B] SUB T1,C,T2 [T2] [T1] - [C] AND T1,T2,X [X] [T1] � [T2] Instruction i + 1 in Figure 9 begins execution during the operand fetch phase of the previous instruction. However, instruction i + 1 cannot continue on to its operand fetch phase, because the very operand it requires does not get written back to the register file for another two clock cycles. Consequently a bubble must be introduced in the pipeline while instruction i + 1 waits for its data. In a similar fashion, the logical AND operation also introduces a bubble as it too requires the result of a previous operation which is in the pipeline. Figure 10 demonstrates a technique called internal forwarding designed to overcome the effects of data dependency. The following sequence of operations is to be executed. ADD 1. [R3] [R1] + [R2] R1,R2,R3 ADD 2. [R6] [R4] + [R5] R4,R5,R6 ADD 3. [R7] [R3] + [R4] R3,R4,R7 ADD 4. [R8] [R7] + [R1] R7,R1,R8 Figure 10 Internal forwarding
  • 22. In this example, instruction 3 (i.e., ADD R3,R4,R7) uses an operand generated by instruction 1 (i.e., the contents of register R3). Because of the intervening instruction 2, the destination operand generated by instruction 1 has time to be written into the register file before it is read as a source operand by instruction 3. Instruction 3 generates a destination operand R7 that is required as a source operand by the next instruction. If the processor were to read the source operand requested by instruction 4 from the register file, it would see the old value of R7. By means of internal forwarding the processor transfers R7 from instruction 3's execution unit directly to the execution unit of instruction 4 (see Figure 10). Accessing External Memory in RISC Systems Conventional CISC processors have a wealth of addressing modes that are used in conjunction with memory reference instructions. For example, the 68020 implements ADD D0,-(A5) which adds the contents of D0 to the top of the stack pointed at by A5 and then pushes the result on to this stack. In their ruthless pursuit of efficiency, the designers of the Berkeley RISC severely restricted the way in which it accesses external memory. The Berkeley RISC permits only two types of reference to external memory: a load and a store. All arithmetic and
  • 23. logical operations carried out by the RISC apply only to source and destination operands in registers. Similarly, the Berkeley RISC provides a limited number of addressing modes with which to access an operand in the main store. It's not hard to find the reason for these restrictions on external memory accesses—an external memory reference takes longer than an internal operation. We now discuss some of the general principles of Berkeley RISC load and store instructions. Consider the load register operation of the form LDXW (Rx)S2,Rd that has the effect [Rd] [M([Rx] + S2)]. The operand address is the contents of the memory location pointed at by register Rx plus offset S2. Figure 11 demonstrates the sequence of actions performed during the execution of this instruction. During the source fetch phase, register Rx is read from the register file and used to calculate the effective address of the operand in the execute phase. However, the processor can't progress beyond the execute phase to the store operand phase, because the operand hasn't been read from the main store. Therefore the main store must be accessed to read the operand and a store operand phase executed to load the operand into destination register Rd. Because memory accesses introduce bubbles into the pipeline, they are avoided wherever possible. Figure 11 The load operation
  • 24. The Berkeley RISC implements two basic addressing modes: indexed and program counter relative. All other addressing modes can (and must) be synthesized from these two primitives. The effective address in the indexed mode is given by: EA = [Rx] + S2 where Rx is the index register (one of the 32 general purpose registers accessible by the current subroutine) and S2 is an offset. The offset can be either a general-purpose register or a 13-bit constant. The effective address in the program counter relative mode is given by: EA = [PC] + S2 where PC represents the contents of the program counter and S2 is an offset as above. These addressing modes include quite a powerful toolbox: zero, one or two pointers and a constant offset. If you wonder how we can use an addressing mode without an index (i.e., pointer) register, remember that R0 in the global register set permanently contains the constant 0. For example, LDXW (R12)R0,R3 uses simple address register indirect addressing, whereas LDXW (R0)123,R3 uses absolute addressing (i.e., memory location 123). There's a difference between addressing modes permitted by load and store operations. A load instruction permits the second source, S2, to be either an immediate value or a second register, whereas a store instruction permits S2 to be a 13-bit immediate value only. This lack of symmetry between the load and store addressing modes is because a "load base+index" instruction requires a register file with two ports, whereas a "store base+index" instruction requires a register file with three ports. Two-ported memory allows two simultaneous accesses. Three-ported memory allows three simultaneous accesses and is harder to design. Figure 1 defines just two basic Berkeley RISC instruction formats. The short immediate format provides a 5-bit destination, a 5-bit source 1 operand and a 14-bit short source 2 operand. The short immediate format has two variations: one that specifies a 13-bit literal for source 2 and one that specifies a 5-bit source 2 register address. Bit 13 specifies whether the source 2 operand is a 13-bit literal or a 5 bit register pointer. The long immediate format provides a 19-bit source operand by concatenating the two source operand fields. Thirteen-bit and 19-bit immediate fields may sound a little strange at first sight. However, since 13 + 19 = 32, the Berkeley RISC permits a full
  • 25. 32-bit value to be loaded into a window register in two operations. In the next section we will discover that the ARM processor deals with literals in a different way. A typical CISC microprocessor might take the same number of instruction bits to perform the same action (i.e., a 32-bit operation code field followed by a 32-bit literal). The following describes some of the addressing modes that can be synthesized from the RISC's basic addressing modes. 1. Absolute addressing EA = 13-bit offset Implemented by setting Rx = R0 = 0, S2 = 13-bit constant. 2. Register indirect EA = [Rx] Implemented by setting S2 = R0 = 0. 3. Indexed addressing EA = [Rx] + Offset Implemented by setting S2 = 13-bit constant. 4. Two-dimensional byte addressing (i.e., byte array access) EA = [Rx] + [Ry] Implemented by setting S2 = [Ry]. This mode is available only for load instructions. Conditional instructions (i.e., branch operations) do not require a destination address and therefore the five bits, 19 to 23, normally used to specify a destination register are used to specify the condition (one of 16 since bit 23 is not used by conditional instructions). Reducing the Branch Penalty If we're going to reduce the effect of branches on the performance of RISC processors, we need to determine the effect of branch instructions on the performance of the system. Because we cannot know how many branches a given program will contain, or how likely each branch is to be taken, we have to construct a probabilistic model to describe the system's performance. We will make the following assumptions: 1. Each non-branch instruction is executed in one cycle 2. The probability that a given instruction is a branch is pb
  • 26. 3. The probability that a branch instruction will be taken is pt 4. If a branch is taken, the additional penalty is b cycles If a branch is not taken, there is no penalty If pb is the probability that an instruction is a branch, 1 - pb is the probability that it is not a branch The average number of cycles executed during the execution of a program is the sum of the cycles taken for non-branch instructions, plus the cycles taken by branch instructions that are taken, plus the cycles taken by branch instructions that are not taken. We can derive an expression for the average number of cycles per instruction as: Tave = (1 - pb)�1 + pb�pt� (1 + b) + pb� (1 - pt) �1 = 1 + pb�pt�b. This expression, 1 + pb�pt�b, tells us that the number of branch instructions, the probability that a branch is taken, and the overhead per branch instruction all contribute to the branch penalty. We are now going to examine some of the ways in which the value of pb�pt�b can be reduced. Branch Prediction If we can predict the outcome of the branch instruction before it is executed, we can start filling the pipeline with instructions from the branch target address (assuming the branch is going to be taken). For example, if the instruction is BRA N, the processor can start fetching instructions at locations N, N + 1, N + 2 etc., as soon as the branch instruction is fetched from memory. In this way, the pipeline is always filled with useful instructions. This prediction mechanism works well with an unconditional branch like BRA N. Unfortunately, conditional branches pose a problem. Consider a conditional branch of the form BCC N (branch to N on carry bit clear). Should the RISC processor make the assumption that the branch will not be taken and fetch instructions in sequence, or should it make the assumption that the branch will be taken and fetch instruction at the branch target address N? As we have already said, conditional branches are required to implement various types of high-level language construct. Consider the following fragment of high-level language code. if (J < K) I = I + L; (for T = 1; T <= I; T++)
  • 27. { . . } The first conditional operation compares J with K. Only the nature of the problem will tell us whether J is often less than K. The second conditional in this fragment of code is provided by the FOR construct that tests a counter at the end of the loop and then decides whether to jump back to the body of the construct or to terminate to loop. In this case, you could bet that the loop is more likely to be repeated than exited. Loops can be executed thousands of times before they are exited. Some computers look at the type of conditional branch and then either fill the pipeline from the branch target if you think that the branch will be taken, or fill the pipeline from the instruction after the branch if you think that it will not be taken. If we attempt to predict the behavior of a system with two outcomes (branch taken or branch not taken), there are four possibilities: 1. Predict branch taken and branch taken — successful outcome 2. Predict branch taken and branch not taken — unsuccessful outcome 3. Predict branch not taken and branch not taken — successful outcome 4. Predict branch not taken and branch taken — unsuccessful outcome Suppose we apply a branch penalty to each of these four possible outcomes. The penalty is the number of cycles taken by that particular outcome, as table 3 demonstrates. For example, if we think that a branch will not be taken and get instructions following the branch and the branch is actually taken (forcing the pipeline to be loaded with instructions at the target address), the branch penalty in table 3 is c cycles. Table 3 The branch penalty Prediction Result Branch penalty Branch taken Branch taken a Branch taken Branch not taken b Branch not taken Branch taken c Branch not taken Branch not taken d We can now calculate the average penalty for a particular system. To do this we need more information about the system. The first thing we need to know is the probability
  • 28. that an instruction will be a branch (as opposed to any other category of instruction). Assume that the probability that an instruction is a branch is pb. The next thing we need to know is the probability that the branch instruction will be taken, pt. Finally, we need to know the accuracy of the prediction. Let pc be the probability that a branch prediction is correct. These values can be obtained by observing the performance of real programs. Figure 12 illustrates all the possible outcomes of an instruction. We can immediately write: (1 - pb) = probability that an instruction is not a branch. (1 - pt) = probability that a branch will not be taken. (1 - pc) = probability that a prediction is incorrect. These equations are obtained by using the principle that if one event or another must take place, their probabilities must add up to unity. The average branch penalty per branch instruction is therefore Cave = a � (pbranch_predicted_taken_and_taken) + b � (pbranch_predicted_taken_but_not_taken) + c � (pbranch_predicted_not_taken_but_taken) + d � (pbranch_predicted_not_taken_and_not_taken) Cave = a � (pt � pc) + b� (1 - pt) � (1 - pc) + c� pt � (1 - pc) + d � (1 - pt) � pc Figure 12 Branch prediction
  • 29. The average number of cycles added due to a branch instruction is Cave � pb = pb � (a � pt � pc + b � (1 - pt) � (1 - pc) + c � pt � (1 - pc) + d � (1 - pt) � pc). We can make two assumptions to help us to simplify this general expression. The first is that a = d = N (i.e., if the prediction is correct the number of cycles is N). The other simplification is that b = c = B (i.e., if the prediction is wrong the number of cycles is B). The average number of cycles per branch instruction is therefore: pb � (N � pt � pc + B � pt � (1 - pc) + B � (1 - pt) � (1 - pc) + N � (1 - pt) � pc) = pb � (N � pc + B � (1 - pc)). This formula can be used to investigate tradeoffs between branch penalties, branch probabilities and pipeline length. There are several ways of implementing branch prediction (i.e., increasing the value of pc). Two basic approaches are static branch
  • 30. prediction and dynamic branch prediction. Static branch prediction makes the assumption that branches are always taken or never taken. Since observations of real code have demonstrated that branches have a greater than 50% chance of being taken, the best static branch prediction mechanism would be to fetch the next instruction from the branch target address as soon as the branch instruction is detected. A better method of predicting the outcome of a branch is by observing its op-code, because some branch instructions are taken more or less frequently that other branch instructions. Using the branch op-code to predict that the branch will or will not be taken results in 75% accuracy. An extension of this technique is to devote a bit of the op-code to the static prediction of branches. This bit is set or cleared by the compiler depending on whether the compiler estimates that the branch is most likely to be taken. This technique provides branch prediction accuracy in the range 74 to 94%. Dynamic branch prediction techniques operate at runtime and use the past behavior of the program to predict its future behavior. Suppose the processor maintains a table of branch instructions. This branch table contains information about the likely behavior of each branch. Each time a branch is executed, its outcome (i.e., taken or not taken is used to update the entry in the table. The processor uses the table to determine whether to take the next instruction from the branch target address (i.e., branch predicted taken) or from the next address in sequence (branch predicted not taken). Single-bit branch predictors provide an accuracy of over 80 percent and five-bit predictors provide an accuracy of up to 98 percent. A typical branch prediction algorithm uses the last two outcomes of a branch to predict its future. If the last two outcomes are X, the next branch is assumed to lead to outcome X. If the prediction is wrong it remains the same the next time the branch is executed (i.e., two failures are needed to modify the prediction). After two consecutive failures, the prediction is inverted and the other outcome assumed. This algorithm responds to trends and is not affected by the occasional single different outcome. Problems 1. What are the characteristics of a CISC processor? 2. The most frequently executed class of instruction is the data move instruction. Why is this? 3. The Berkeley RISC has a 32-bit architecture and yet provides only a 13-bit literal. Why is this and does it really matter?
  • 31. 4. What are the advantages and disadvantages of register windowing? 5. What is pipelining and how does it increase the performance of a computer? 6. A pipeline is defined by its length (i.e., the number of stages that can operate in parallel). A pipeline can be short or long. What do you think are the relative advantages of longs and short pipelines? 7. What is data dependency in a pipelined system and how can its effects be overcome? 8. RISC architectures don't permit operations on operands in memory other than load and store operations. Why? 9. The average number of cycles required by a RISC to execute an instruction is given by Tave = 1 + pb�pt�b. where The probability that a given instruction is a branch is pb The probability that a branch instruction will be taken is pt If a branch is taken, the additional penalty is b cycles If a branch is not taken, there is no penalty Draw a series of graphs of the average number of cycles per instruction as a function of pb�pt for b = 1, 2, 3, and 4. 10. What is branch prediction and how can it be used to reduce the so-called branch penalty in a pipelined system?