Here are the answers to the questions:
1. Pipeline cycle time = Maximum delay of any stage + Latch delay
= 90 ns + 10 ns = 100 ns
2. Non-pipeline execution time for one task = Total delay of all stages
= 60 + 50 + 90 + 80 = 280 ns
3. Speed up ratio = Non-pipeline time/Pipeline time
= 280/100 = 2.8
4. Pipeline time for 1000 tasks = Pipeline cycle time x Number of tasks
= 100 ns x 1000 = 100,000 ns = 100 μs
5. Sequential time for 1000 tasks = Non-pipeline time per task x Number of tasks
= 280 ns x 1000 = 280,
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
Unit - 5 Pipelining.pptx
1. MEDICAPS UNIVERSITY
UNIT - 5
Course Code Course Name Hours PerWeek Total Credits
L T P
IT3CO20 Computer SystemArchitecture 3 1 2 5
FACULTY OF ENGINEERING
Department of InformationTechnology
2. Syllabus
•Parallel Processing, Pipeline Processing, Instruction
and Arithmetic Pipeline, Pipeline hazards and their
resolution,
•Vector Processing – vector operations, memory
interleaving, matrix multiplication, Supercomputers,
•Array Processors – attached and SIMD array
processors
3. Parallel Processing
• Parallel processing can be described as a class of techniques
which enables the system to achieve simultaneous data-
processing tasks to increase the computational speed of a
computer system.
• A parallel processing system can carry out simultaneous data-
processing to achieve faster execution time. For instance, while
an instruction is being processed in the ALU component of the
CPU, the next instruction can be read from memory.
• The primary purpose of parallel processing is to enhance the
computer processing capability and increase its throughput.
4. Parallel Processing
• The term parallel processing indicates that the system is able to
perform several operations in a single time.
• Now we will elaborate the scenario, in a CPU we will be having
only one Accumulator which will be storing the results obtained
from the current operation.
• Now if we are giving only one command such that “a+b” then
the CPU performs the operation and stores the result in the
accumulator.
• Now we are talking about parallel processing, therefore we will
be issuing two instructions “a+b” and “c-d” in the same time,
5. Parallel Processing
• now if the result of “a+b” operation is stored in the accumulator, then
“c-d” result cannot be stored in the accumulator in the same time.
• Therefore the term parallel processing in not only based on the
Arithmetic, logic or shift operations.
• The above problem can be solved in the following manner:
• Consider the registers R1 and R2 which will be storing the operands
before operation and R3 is the register which will be storing the
results after the operations.
• Now the above two instructions “a+b” and “c-d” will be done in
parallel as follows.
6. Parallel Processing
• Values of “a” and “b” are fetched in to the registers R1 and R2
• The values of R1 and R2 will be sent into the ALU unit to
perform the addition.
• The result will be stored in the Accumulator.
• When the ALU unit is performing the calculation, the next data
“c” and “d” are brought into R1 and R2.
• Finally the value of Accumulator obtained from “a+b” will be
transferred into the R3.
7. Parallel Processing
• Next the values of C and D from R1 and R2 will be brought into
the ALU to perform the “c-d” operation.
• Since the accumulator value of the previous operation is present
in R3, the result of “c-d” can be safely stored in the Accumulator.
• This is the process of parallel processing of only one CPU.
• Consider several such CPU performing the calculations
separately.
• This is the concept of parallel processing.
9. Parallel Processing
• In the above figure we can see that the data stored in the processor registers is
being sent to separate devices basing on the operation needed on the data.
• If the data inside the processor registers is requesting for an arithmetic operation,
then the data will be sent to the arithmetic unit and if in the same time another data
is requested in the logic unit, then the data will be sent to logic unit for logical
operations.
• Now in the same time both arithmetic operations and logical operations are
executing in parallel.This is called as parallel processing.
• Instruction Stream: The sequence of instructions read from the memory is called
as an Instruction Stream
• Data Stream: The operations performed on the data in the processor is called as a
Data Stream.
10. Parallel Processing
• A system may have two or more processor operating
concurrently.
• For example, the arithmetic, logic, and shift operations can be
separated into three units and the operands diverted to each
unit under the supervision of a control unit.
• Parallel processing is established by distributing the data among
the multiple functional units.
• In this chapter we consider parallel processing under the
following main topics:
• 1. Pipeline processing 2.Vector processing 3. Array processors
11.
12. Pipelining
• A program consists of several number of instructions.
• These instructions may be executed in the following two ways-
• 1) Non-Pipelined Execution 2) Pipelined Execution
• 1) Non-Pipelined Execution
• All the instructions of a program are executed sequentially one
after the other.
• A new instruction executes only after the previous instruction
has executed completely.
• This style of executing the instructions is highly inefficient.
13. 1.) Non-Pipelined Execution
• Consider a program consisting of three instructions.
• In a non-pipelined architecture, these instructions execute one
after the other as-
• If time taken for executing one instruction = t, then-
• Time taken for executing ‘n’ instructions = n x t
15. 2) Pipelined Execution
• In pipelined architecture,
• Multiple instructions are executed parallelly.
• This style of executing the instructions is highly efficient.
• A pipelined processor does not wait until the previous
instruction has executed completely.
• Rather, it fetches the next instruction and begins its execution.
• The hardware of the CPU is split up into several functional units.
• Each functional unit performs a dedicated task.
17. 2) Pipelined Execution
• The number of functional units may vary from processor to processor.
• These functional units are called as stages of the pipeline.
• Control unit manages all the stages using control signals.
• There is a register associated with each stage that holds the data.
• There is a global clock that synchronizes the working of all the stages.
• At the beginning of each clock cycle, each stage takes the input from
its register.
• Each stage then processes the data and feed its output to the register
of the next stage.
18.
19. Pipelining
• Pipelining is a technique of decomposing a sequential process
into sub-operations, with each subprocess being executed in a
special dedicated segment that operates concurrently with all
other segments.
• A pipeline can be visualized as a collection of processing
segments through which binary information flows.
• Each segment performs partial processing dictated by the way
the task is partitioned.
• The result obtained from the computation in each segment is
transferred to the next segment in the pipeline.
20. Pipelining
• The final result is obtained after the data have passed
through all segments.
• It is characteristic of pipelines that several computations can
be in progress in distinct segments at the same time.
• The overlapping of computation is made possible by
associating a register with each segment in the pipeline.
• The registers provide isolation between each segment so
that each can operate on distinct data simultaneously.
21. Pipelining
• A clock is applied to all registers after enough time has
elapsed to perform all segment activity.
• Suppose that we want to perform the combined multiply
and add operations with a stream of numbers. Ai * Bi + Ci
for i = 1,2,3,4,5,6,7
• Each sub-operation is to be implemented in a segment
within a pipeline.
• Each segment has one or two registers and a combinational
circuit
22. Pipelining
•R 1 through RS are registers that receive new data with
every clock pulse.
•The multiplier and adder are combinational circuits. The
sub-operations performed in each segment of the
pipeline are as follows:
•R 1 <-- Ai, R2 <-- Bi Input Ai and Bi
•R3 <--R 1 * R2, R4 <-- Ci Multiply and input Ci
•R5 <--R3 + R4 Add Ci to product
23.
24.
25. Pipelining
• The five registers are loaded with new data at every clock
pulse.
• Any operation that can be decomposed into a sequence of
sub-operations of about the same complexity can be
implemented by a pipeline processor.
• The technique is efficient for those applications that need to
repeat the same task many times with different sets of data.
• The operands pass through all four segments in a fixed
sequence
26. Pipelining
• Each segment consists of a combinational circuit Si that
performs a sub-operation over the data stream flowing
through the pipe.
• The segments are separated by registers Ri that hold the
intermediate results between the stages.
• Information flows between adjacent stages under the
control of a common clock applied to all the registers
simultaneously.
• The behavior of a pipeline can be illustrated with a space-
time diagram.
27. Four-Stage Pipeline-
• In four stage pipelined architecture, the execution of each
instruction is completed in following 4 stages-
1. Instruction fetch (IF)
2. Instruction decode (ID)
3. Instruction Execute (IE)
4. Write back (WB)
• The hardware of the CPU is divided into four functional
units.
• Each functional unit perform a dedicated task.
30. Pipelining
• Now consider the case where a k-segment pipeline with a
clock cycle time tp is used to execute n tasks.
• The first task T1 requires a time equal to k*tp to complete its
operation since there are k segments in the pipe.
• The remaining n - 1 tasks emerge from the pipe at the rate
of one task per clock cycle and they will be completed after
a time equal to (n - 1)*tp.
• Therefore, to complete n tasks using a k-segment pipeline
requires k + (n - 1) clock cycles.
31. Pipelining
• For example, the diagram shows four segments and six
tasks.
• The time required to complete all the operations is 4 + (6 - 1)
= 9 clock cycles, as indicated in the diagram.
•Next consider a non-pipeline unit that performs the
same operation and takes a time equal to tn to complete
each task.
•The total time required for n tasks is n*tn.
32. Pipelining
• The speed up a pipeline processing over an equivalent non
pipeline processing is defined by the ratio:
• Clock per Instruction CPI nearly equal to 1.
• Efficiency (Utilization)= Used block/Total blocks.
• Speedup= non-pipelined/pipelined
33. Arithmetic Pipelining
• Arithmetic Pipelines are mostly used in high-speed
computers.
• They are used to implement floating-point operations,
multiplication of fixed-point numbers, and similar
computations encountered in scientific problems.
• Floating-point operations are easily decomposed into sub-
operations.
• We will now show an example of a pipeline unit for floating-
point addition and subtraction.
34. Arithmetic Pipelining
• The inputs to the floating-point adder pipeline are two
normalized floating-point binary numbers.
X=A x 2a
Y=B x 2b
• A and B are two fractions that represent the mantissas and a
and b are the exponents.
• The floating-point addition and subtraction can be
performed in four segments.
• The registers labeled R are placed between the segments to
store intermediate results.
35. Arithmetic Pipelining
• The sub-operations that are performed in the four segments
are:
• 1. Compare the exponents.
• 2. Align the mantissas.
• 3. Add or subtract the mantissas.
• 4. Normalize the result.
• The exponents are compared by subtracting them to
determine their difference.
• The larger exponent is chosen as the exponent of the result.
36. Arithmetic Pipelining
• The exponent difference determines how many times the
mantissa associated with the smaller exponent must be shifted
to the right.
• This produces an alignment of the two mantissas.
• It should be noted that the shift must be designed as a
combinational circuit to reduce the shift time.
• The two mantissas are added or subtracted in segment 3.
• The result is normalized in segment 4.
• When an overflow occurs, the mantissa of the sum or difference
is shifted right and the exponent incremented by one.
37. Arithmetic Pipelining
• If an underflow occurs, the number of leading zeros in the
mantissa determines the number of left shifts in the
mantissa and the number that must be subtracted from the
exponent.
• X= 0.9504 X 103
• Y = 0.8200 X 102
• The two exponents are subtracted in the first segment to
obtain 3 - 2 = 1.
• The larger exponent 3 is chosen as the exponent of the
result.
38. Arithmetic Pipelining
• The next segment shifts the mantissa ofY to the right to
obtain
• X= 0.9504 X 103
• Y = 0.0820 X 103
• This aligns the two mantissas under the same exponent.
• The addition of the two mantissas in segment 3 produces
the sum
• Z = 1.0324 X 103
39. Arithmetic Pipelining
• The sum is adjusted by normalizing the result so that it has a
fraction with a nonzero first digit.
• This is done by shifting the mantissa once to the right and
incrementing the exponent by one to obtain the normalized
sum.
• Z = 0.10324 X 104
40.
41.
42.
43. Instruction Pipelining
• Pipeline processing can occur not only in the data stream
but in the instruction stream as well.
• Most of the digital computers with complex instructions
require instruction pipeline to carry out operations like
fetch, decode and execute instructions.
• Computers with complex instructions require other phases
in addition to the fetch and execute to process an
instruction completely.
• In the most general case, the computer needs to process
each instruction with the following sequence of steps-
44. Instruction Pipelining
1. Fetch the instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.
45. 4 – Segment Instruction Pipelining
• While an instruction is being executed in segment 4, the next
instruction in sequence is busy fetching an operand from
memory in segment 3.
• The effective address may be calculated in a separate
arithmetic circuit for the third instruction, and whenever the
memory is available, the fourth and all subsequent
instructions can be fetched and placed in an instruction FIFO.
• Thus up to four sub-operations in the instruction cycle can
overlap and up to four different instructions can be in progress
of being processed at the same time.
46.
47. 4 – Segment Instruction Pipelining
• The time in the horizontal axis is divided into steps of equal
duration.
• The four segments are represented in the diagram with an
abbreviated symbol.
• 1. Fl is the segment that fetches an instruction.
• 2. DA is the segment that decodes the instruction and
calculates the effective address.
• 3. FO is the segment that fetches the operand.
• 4. EX is the segment that executes the instruction.
48. 4 – Segment Instruction Pipelining
• It is assumed that the processor has separate instruction and
data memories so that the operation in Fl and FO can
proceed at the same time.
• In a absence of a branch instruction, each segment operates
on different instructions.
• Thus, in step 4, instruction 1 is being executed in segment
EX; the operand for instruction 2 is being fetched in
segment FO; instruction 3 is being decoded in segment DA;
and instruction 4 is being fetched from memory in segment
FL.
49. 4 – Segment Instruction Pipelining
• Assume now that instruction 3 is a branch instruction.
• As soon as this instruction is decoded in segment DA in step
4, the transfer from FI to DA of the other instructions is
halted until the branch instruction is executed in step 6.
• If the branch is taken, a new instruction is fetched in step 7.
• If the branch is not taken, the instruction fetched previously
in step 4 can be used.
• The pipeline then continues until a new branch instruction is
encountered.
50. 4 – Segment Instruction Pipelining
• Another delay may occur in the pipeline if the EX segment
needs to store the result of the operation in the data
memory while the FO segment needs to fetch an operand.
• In that case, segment FO must wait until segment EX has
finished its operation.
• In general, there are three major difficulties that cause the
instruction pipeline to deviate from its normal operation.
1. Resource conflicts caused by access to memory by two
segments at the same time. Most of these conflicts can be
resolved by using separate instruction and data memories.
51. 4 – Segment Instruction Pipelining
2. Data dependency conflicts arise when an instruction
depends on the result of a previous instruction, but this result
is not yet available.
3. Branch difficulties arise from branch and other instructions
that change the value of PC.
53. Pipelining Numerical
Q.1 Consider a pipeline having 4 phases with duration 60, 50,
90 and 80 ns. Given latch delay is 10 ns. Calculate-
1. Pipeline cycle time
2. Non-pipeline execution time
3. Speed up ratio
4. Pipeline time for 1000 tasks
5. Sequential time for 1000 tasks
54. 1: Pipeline CycleTime-
Cycle time = Maximum delay due to any stage + Delay due to
its register
= Max { 60, 50, 90, 80 } + 10 ns
= 90 ns + 10 ns
= 100 ns
2: Non-Pipeline ExecutionTime-
Non-pipeline execution time for one instruction
= 60 ns + 50 ns + 90 ns + 80 ns
= 280 ns
55. 3: Speed Up Ratio-
Speed up = Non-pipeline execution time / Pipeline execution
time
= 280 ns / Cycle time
= 280 ns / 100 ns
= 2.8
4: PipelineTime For 1000Tasks-
Pipeline time for 1000 tasks
=Time taken for 1st task +Time taken for remaining 999
tasks
= 1 x 4 clock cycles + 999 x 1 clock cycle
56. = 4 x cycle time + 999 x cycle time
= 4 x 100 ns + 999 x 100 ns
= 400 ns + 99900 ns
= 100300 ns
5: SequentialTime For 1000Tasks-
Non-pipeline time for 1000 tasks
= 1000 xTime taken for one task
= 1000 x 280 ns
= 280000 ns
57. Q.2 Consider a 4 stage pipeline processor. The number of
cycles needed by the four instructions I1, I2, I3 and I4 in
stages IF, ID, EX and WB. How many clock cycle will be
required to complete these 4 instructions in the given
pipeline system.
IF
ID EX WB
I1 1 3 2 1
I2 2 2 3 1
I3 1 1 1 1
I4 2 1 1 2
59. Q.3 Consider a 4 stage pipeline processor. The number of
cycles needed by the four instructions I1, I2, I3 and I4 in
stages S1, S2, S3 and S4 is shown below-What is the number
of cycles needed to execute the following loop?
for (i=1 to 2) { I1; I2; I3; I4; }
S1
S2 S3 S4
I1 2 1 1 1
I2 1 3 2 2
I3 2 1 1 3
I4 1 2 2 2
60. From here, number of clock cycles required to execute the
loop = 23 clock cycles.
61. Pipeline hazards and their resolution
There are mainly three types of Hazards possible in a
pipelined processor.
1) Structural Hazards
2) Control Hazards
3) Data Hazards
These dependencies (Hazards) may introduce stalls in the
pipeline.
Stall : A stall is a cycle in the pipeline without new input.
62. Structural Hazards
This dependency arises due to the resource conflict in the
pipeline. A resource conflict is a situation when more than
one instruction tries to access the same resource in the same
cycle.A resource can be a register, memory, or ALU.
INSTRUCTIO
N / CYCLE
1 2 3 4 5
I1 IF(Mem) ID EX Mem
I2 IF(Mem) ID EX Mem
I3 IF(Mem) ID EX
I4 IF(Mem) ID
63. Structural Hazards
• In the previous scenario, in cycle 4, instructions I1 and I4 are trying
to access same resource (Memory) which introduces a resource
conflict.
• To avoid this problem, we have to keep the instruction on wait
until the required resource (memory in our case) becomes
available.
Cycle 1 2 3 4 5 6 7 8
I1 IF(Mem) ID EX Mem
I2 IF(Mem) ID EX Mem
I3 IF(Mem) ID EX Mem
I4 - - - IF(Mem) ID
64. Structural Hazards
• Solution for structural dependency
To minimize structural dependency stalls in the pipeline, we
use a hardware mechanism called Renaming.
• Renaming : According to renaming, we divide the memory
into two independent modules used to store the instruction
and data separately called Code memory(CM) and Data
memory(DM) respectively.
• CM will contain all the instructions and DM will contain all
the operands that are required for the instructions.
65. Structural Hazards
Inst.
/Cycle
1 2 3 4 5 6 7
I1 IF(CM) ID EX DM WB
I2 IF(CM) ID EX DM WB
I3 IF(CM) ID EX DM WB
I4 IF(CM) ID EX DM
I5 IF(CM) ID EX
I6 IF(CM) ID
I7 IF(CM)
66. Control Dependency (Branch Hazards)
• This type of dependency occurs during the transfer of
control instructions such as BRANCH, CALL, JMP, etc.
• On many instruction architectures, the processor will not
know the target address of these instructions when it needs
to insert the new instruction into the pipeline.
• Due to this, unwanted instructions are fed to the pipeline.
• NOTE: Generally, the target address of the JMP instruction
is known after ID stage only.
67. Control Dependency (Branch Hazards)
• Consider the following sequence of instructions in the
program:
100: I1
101: I2 (JMP 250)
102: I3
.
.
250: BI1
• Expected output: I1 -> I2 -> BI1
• Output Sequence: I1 -> I2 -> I3 -> BI1
68. Control Dependency (Branch Hazards)
Inst.
/Cycle
1 2 3 4 5 6
I1 IF(Mem) ID EX Mem
I2 IF(Mem) ID EX Mem
I3 IF(Mem) ID EX Mem
BI1 IF(Mem) ID EX
So, the output sequence is not equal to the expected output,
that means the pipeline is not implemented correctly.
69. Control Dependency (Branch Hazards)
Inst.
/Cycle
1 2 3 4 5 6
I1 IF(Mem) ID EX Mem
I2 IF(Mem) ID EX Mem
DELAY - - - - - -
BI1 IF(Mem) ID EX
To correct the above problem we need to stop the Instruction
fetch until we get target address of branch instruction. This can
be implemented by introducing delay slot until we get the target
address.
70. Control Dependency (Branch Hazards)
• Output Sequence: I1 -> I2 -> Delay (Stall) -> BI1As the delay
slot performs no operation, this output sequence is equal to
the expected output sequence. But this slot introduces stall
in the pipeline.
• Solution for Control dependency Branch Prediction is the
method through which stalls due to control dependency can
be eliminated. In this at 1st stage prediction is done about
which branch will be taken. For branch prediction Branch
penalty is zero.
• Branch penalty : The number of stalls introduced during the
branch operations in the pipelined processor is known as
branch penalty.
71. Control Dependency (Branch Hazards)
• NOTE : As we see that the target address is available after
the ID stage, so the number of stalls introduced in the
pipeline is 1.
• Suppose, the branch target address would have been
present after the ALU stage, there would have been 2 stalls.
Generally, if the target address is present after the kth stage,
then there will be (k – 1) stalls in the pipeline.
• Total number of stalls introduced in the pipeline due to
branch instructions = Branch frequency * Branch Penalty
72. Data Hazards
• Data hazards occur when instructions that exhibit data
dependence, modify data in different stages of a pipeline.
Hazard cause delays in the pipeline. There are mainly three
types of data hazards:
• 1)RAW (Read afterWrite) [Flow/True data dependency]
2)WAR (Write after Read) [Anti-Data dependency]
3)WAW (Write afterWrite) [Output data dependency]
•WAR andWAW hazards occur during the out-of-order
execution of the instructions.
73. Data Hazards - RAW
• RAW hazard occurs when instruction I2 tries to read data
before instruction I1 writes it. Eg:
I1: R2 <- R1 + R3
I2: R4 <- R2 + R3
• Domain (I1)= R1,R3
• Domain (I2)= R2,R3
• Range (I1)= R2
• Range (I2)= R4
• Range (I1) ∩ Domain (I2) ≠ ɸ
74. Data Hazards - RAW
Cycle/Inst. 1 2 3 4 5
I1 IF/D OF EX WB
I2 IF/D OF EX WB
Cycle/I
nst.
1 2 3 4 5 6 7 8
I1 IF/D OF EX WB
I2 - - - IF/D OF EX WB
75. Data Hazards - WAR
• WAR hazard occurs when instruction I2 tries to write data
before instruction I1 reads it. Eg:
I1: R2 <- R1 + R3
I2: R3 <- R4 + R5
• Domain (I1)= R1,R3
• Domain (I2)= R4,R5
• Range (I1)= R2
• Range (I2)= R3
• Domain (I1) ∩ Range (I2) ≠ ɸ
76. Data Hazards - WAW
• WAW hazard occurs when instruction I2 tries to write output
before instruction I1 writes it. Eg:
I1: R2 <- R1 + R3
I2: R2 <- R4 + R5
• Domain (I1)= R1,R3
• Domain (I2)= R4,R5
• Range (I1)= R2
• Range (I2)= R2
• Range(I1) ∩ Range (I2) ≠ ɸ
77. Array Processor
• An array processor is a processor that performs computations
on large arrays of data.
• The term is used to refer to two different types of processors.
• An attached array processor is an auxiliary processor
attached to a general-purpose computer.
• It is intended to improve the performance of the host
computer in specific numerical computation tasks.
• An SIMD array processor is a processor that has a single-
instruction multiple-data organization.
78. Array Processor
• It manipulates vector instructions by means of multiple
functional units responding to a common instruction.
• Although both types of array processors manipulate vectors,
their internal organization is different.
• The system with the attached processor satisfies the needs
for complex arithmetic applications.
• The objective of the attached array processor is to provide
vector manipulation capabilities to a conventional computer
at a fraction of the cost of supercomputers.
79. AttachedArray Processor
• An attached array processor is designed as a peripheral for a
conventional host computer, and its purpose is to enhance
the performance of the computer by providing vector
processing for complex scientific applications.
• It achieves high performance by means of parallel processing
with multiple functional units.
• It includes an arithmetic unit containing one or more
pipelined floating-point adders and multipliers.
• The array processor can be programmed by the user to
accommodate a variety of complex arithmetic problems.
80. AttachedArray Processor
• The host computer is a general-purpose commercial
computer and the attached processor is a back-end machine
driven by the host computer.
• The array processor is connected through an input-output
controller to the computer and the computer treats it like an
external interface.
• The data for the attached processor are transferred from
main memory to a local memory through a high-speed bus.
• The general-purpose computer without the attached
processor serves the users that need conventional data
processing.
82. SIMDArray Processor
• An SIMD array processor is a computer with multiple
processing units operating in parallel.
• The processing units are synchronized to perform the same
operation under the control of a common control unit, thus
providing a single instruction stream, multiple data stream
(SIMD) organization.
• It contains a set of identical processing elements (PEs), each
having a local memory M.
• Each processor element includes an ALU, a floating-point
arithmetic unit, and working registers.
83. SIMDArray Processor
• The master control unit controls the operations in the
processor elements.
• The main memory is used for storage of the program.
• The function of the master control unit is to decode the
instructions and determine how the instruction is to be
executed.
• Scalar and program control instructions are directly executed
within the master control unit. Vector instructions are
broadcast to all PEs simultaneously.
• Each PE uses operands stored in its local memory.
84.
85. SIMDArray Processor
• Consider, for example, the vector addition C = A + B.
• The master control unit first stores the ith components ai and
bi of A and B in local memory M, for i = 1, 2, 3, . . . , n.
• It then broadcasts floating-point add instruction ci = ai + bi to
all PEs, causing the addition to take place simultaneously.
• The components of ci are stored in fixed locations in each
local memory.
• This produces the desired vector sum in one add cycle.
86. SIMDArray Processor
• Masking schemes are used to control the status of each PE
during the execution of vector instructions.
• Each PE has a flag that is set when the PE is active and reset
when the PE is inactive.
• This ensures that only those PEs that need to participate are
active during the execution of the instruction.
• For example, suppose that the array processor contains a set
of 64 PEs. If a vector length of less than 64 data items is to be
processed the control unit selects the proper number of PEs
to be active.
87. Vector Processing
• There is a class of computational problems that are beyond
the capabilities of a conventional computer.
• These problems are characterized by the fact that they
require a vast number of computations that will take a
conventional computer days or even weeks to complete.
• Computers with vector processing capabilities are in demand
in specialized applications.
• The following are representative application areas where
vector processing is of the utmost importance.
88. Vector Processing - Applications
• Long-range weather forecasting
• Petroleum explorations
• Medical diagnosis
• Aerodynamics and space flight simulations
• Artificial intelligence and expert systems
• Image processing
89. Vector Operations
• Many scientific problems require arithmetic operations on
large arrays of numbers.
• These numbers are usually formulated as vectors and
matrices of floating-point numbers.
• A vector is an ordered set of a one-dimensional array of data
items.
• A vector V of length n is represented as a row vector by V =
[V1 V2 V3 · · · Vn]. It may be represented as a column vector if
the data items are listed in a column.
• A conventional sequential computer is capable of processing
operands one at a time.
90. Vector Operations
• Consequently, operations on vectors must be broken down
into single computations with subscripted variables.
• The element Vi of vector V is written as V(I) and the index I
refers to a memory address or register where the number is
stored.
• To examine the difference between a conventional scalar
processor and a vector processor, consider the following
Fortran DO loop:
DO 20 I= 1, 100
20 C (I) = B(I) + A ( I)
91. Vector Operations
• This is a program for adding two vectors A and B of length 100 to
produce a vector C. This is implemented in machine language by
the following sequence of operations.
Initialize I= 0
20 Read A(I)
Read B(I)
Store C(I) = A ( I ) + B(I)
Increment I= I+ 1
If I<=100 go to 20
Continue
92. Vector Operations
• This constitutes a program loop that reads a pair of operands
from arrays A and B and performs a floating-point addition.
• The loop control variable is then updated and the steps
repeat 100 times.
• A computer capable of vector processing eliminates the
overhead associated with the time it takes to fetch and
execute the instructions in the program loop.
• It allows operations to be specified with a single vector
instruction of the form
C(1 : 100) = A(1 : 100) + B(1 : 100)
94. Vector Instruction
• The vector instruction includes the initial address of the
operands, the length of the vectors, and the operation to be
performed, all in one composite instruction.
• This is essentially a three-address instruction with three fields
specifying the base address of the operands and an additional
field that gives the length of the data items in the vectors.
This assumes that the vector operands reside in memory.
95. Vector Instruction
1. Operation Code:- Operation code indicates the operation that has to
be performed in the given instruction. It decides the functional unit for
the specified operation or reconfigures the multifunction unit.
2. Base Address:- Base address field refers to the memory
location from where the operands are to be fetched or to where the
result has to be stored. The base address is found in the memory
reference instructions. In the vector instruction, the operand and the
result both are stored in the vector registers. Here, the base
address refers to the designated vector register.
3. Vector Length:-Vector length specifies the number of elements in a
vector operand. It identifies the termination of a vector instruction.