SlideShare a Scribd company logo
1 of 95
MEDICAPS UNIVERSITY
UNIT - 5
Course Code Course Name Hours PerWeek Total Credits
L T P
IT3CO20 Computer SystemArchitecture 3 1 2 5
FACULTY OF ENGINEERING
Department of InformationTechnology
Syllabus
•Parallel Processing, Pipeline Processing, Instruction
and Arithmetic Pipeline, Pipeline hazards and their
resolution,
•Vector Processing – vector operations, memory
interleaving, matrix multiplication, Supercomputers,
•Array Processors – attached and SIMD array
processors
Parallel Processing
• Parallel processing can be described as a class of techniques
which enables the system to achieve simultaneous data-
processing tasks to increase the computational speed of a
computer system.
• A parallel processing system can carry out simultaneous data-
processing to achieve faster execution time. For instance, while
an instruction is being processed in the ALU component of the
CPU, the next instruction can be read from memory.
• The primary purpose of parallel processing is to enhance the
computer processing capability and increase its throughput.
Parallel Processing
• The term parallel processing indicates that the system is able to
perform several operations in a single time.
• Now we will elaborate the scenario, in a CPU we will be having
only one Accumulator which will be storing the results obtained
from the current operation.
• Now if we are giving only one command such that “a+b” then
the CPU performs the operation and stores the result in the
accumulator.
• Now we are talking about parallel processing, therefore we will
be issuing two instructions “a+b” and “c-d” in the same time,
Parallel Processing
• now if the result of “a+b” operation is stored in the accumulator, then
“c-d” result cannot be stored in the accumulator in the same time.
• Therefore the term parallel processing in not only based on the
Arithmetic, logic or shift operations.
• The above problem can be solved in the following manner:
• Consider the registers R1 and R2 which will be storing the operands
before operation and R3 is the register which will be storing the
results after the operations.
• Now the above two instructions “a+b” and “c-d” will be done in
parallel as follows.
Parallel Processing
• Values of “a” and “b” are fetched in to the registers R1 and R2
• The values of R1 and R2 will be sent into the ALU unit to
perform the addition.
• The result will be stored in the Accumulator.
• When the ALU unit is performing the calculation, the next data
“c” and “d” are brought into R1 and R2.
• Finally the value of Accumulator obtained from “a+b” will be
transferred into the R3.
Parallel Processing
• Next the values of C and D from R1 and R2 will be brought into
the ALU to perform the “c-d” operation.
• Since the accumulator value of the previous operation is present
in R3, the result of “c-d” can be safely stored in the Accumulator.
• This is the process of parallel processing of only one CPU.
• Consider several such CPU performing the calculations
separately.
• This is the concept of parallel processing.
Parallel Processing
Parallel Processing
• In the above figure we can see that the data stored in the processor registers is
being sent to separate devices basing on the operation needed on the data.
• If the data inside the processor registers is requesting for an arithmetic operation,
then the data will be sent to the arithmetic unit and if in the same time another data
is requested in the logic unit, then the data will be sent to logic unit for logical
operations.
• Now in the same time both arithmetic operations and logical operations are
executing in parallel.This is called as parallel processing.
• Instruction Stream: The sequence of instructions read from the memory is called
as an Instruction Stream
• Data Stream: The operations performed on the data in the processor is called as a
Data Stream.
Parallel Processing
• A system may have two or more processor operating
concurrently.
• For example, the arithmetic, logic, and shift operations can be
separated into three units and the operands diverted to each
unit under the supervision of a control unit.
• Parallel processing is established by distributing the data among
the multiple functional units.
• In this chapter we consider parallel processing under the
following main topics:
• 1. Pipeline processing 2.Vector processing 3. Array processors
Pipelining
• A program consists of several number of instructions.
• These instructions may be executed in the following two ways-
• 1) Non-Pipelined Execution 2) Pipelined Execution
• 1) Non-Pipelined Execution
• All the instructions of a program are executed sequentially one
after the other.
• A new instruction executes only after the previous instruction
has executed completely.
• This style of executing the instructions is highly inefficient.
1.) Non-Pipelined Execution
• Consider a program consisting of three instructions.
• In a non-pipelined architecture, these instructions execute one
after the other as-
• If time taken for executing one instruction = t, then-
• Time taken for executing ‘n’ instructions = n x t
1.) Non-Pipelined Execution
2) Pipelined Execution
• In pipelined architecture,
• Multiple instructions are executed parallelly.
• This style of executing the instructions is highly efficient.
• A pipelined processor does not wait until the previous
instruction has executed completely.
• Rather, it fetches the next instruction and begins its execution.
• The hardware of the CPU is split up into several functional units.
• Each functional unit performs a dedicated task.
2) Pipelined Execution
2) Pipelined Execution
• The number of functional units may vary from processor to processor.
• These functional units are called as stages of the pipeline.
• Control unit manages all the stages using control signals.
• There is a register associated with each stage that holds the data.
• There is a global clock that synchronizes the working of all the stages.
• At the beginning of each clock cycle, each stage takes the input from
its register.
• Each stage then processes the data and feed its output to the register
of the next stage.
Pipelining
• Pipelining is a technique of decomposing a sequential process
into sub-operations, with each subprocess being executed in a
special dedicated segment that operates concurrently with all
other segments.
• A pipeline can be visualized as a collection of processing
segments through which binary information flows.
• Each segment performs partial processing dictated by the way
the task is partitioned.
• The result obtained from the computation in each segment is
transferred to the next segment in the pipeline.
Pipelining
• The final result is obtained after the data have passed
through all segments.
• It is characteristic of pipelines that several computations can
be in progress in distinct segments at the same time.
• The overlapping of computation is made possible by
associating a register with each segment in the pipeline.
• The registers provide isolation between each segment so
that each can operate on distinct data simultaneously.
Pipelining
• A clock is applied to all registers after enough time has
elapsed to perform all segment activity.
• Suppose that we want to perform the combined multiply
and add operations with a stream of numbers. Ai * Bi + Ci
for i = 1,2,3,4,5,6,7
• Each sub-operation is to be implemented in a segment
within a pipeline.
• Each segment has one or two registers and a combinational
circuit
Pipelining
•R 1 through RS are registers that receive new data with
every clock pulse.
•The multiplier and adder are combinational circuits. The
sub-operations performed in each segment of the
pipeline are as follows:
•R 1 <-- Ai, R2 <-- Bi Input Ai and Bi
•R3 <--R 1 * R2, R4 <-- Ci Multiply and input Ci
•R5 <--R3 + R4 Add Ci to product
Pipelining
• The five registers are loaded with new data at every clock
pulse.
• Any operation that can be decomposed into a sequence of
sub-operations of about the same complexity can be
implemented by a pipeline processor.
• The technique is efficient for those applications that need to
repeat the same task many times with different sets of data.
• The operands pass through all four segments in a fixed
sequence
Pipelining
• Each segment consists of a combinational circuit Si that
performs a sub-operation over the data stream flowing
through the pipe.
• The segments are separated by registers Ri that hold the
intermediate results between the stages.
• Information flows between adjacent stages under the
control of a common clock applied to all the registers
simultaneously.
• The behavior of a pipeline can be illustrated with a space-
time diagram.
Four-Stage Pipeline-
• In four stage pipelined architecture, the execution of each
instruction is completed in following 4 stages-
1. Instruction fetch (IF)
2. Instruction decode (ID)
3. Instruction Execute (IE)
4. Write back (WB)
• The hardware of the CPU is divided into four functional
units.
• Each functional unit perform a dedicated task.
Four Segment Pipeline
SpaceTime Diagram for Pipelining
Pipelining
• Now consider the case where a k-segment pipeline with a
clock cycle time tp is used to execute n tasks.
• The first task T1 requires a time equal to k*tp to complete its
operation since there are k segments in the pipe.
• The remaining n - 1 tasks emerge from the pipe at the rate
of one task per clock cycle and they will be completed after
a time equal to (n - 1)*tp.
• Therefore, to complete n tasks using a k-segment pipeline
requires k + (n - 1) clock cycles.
Pipelining
• For example, the diagram shows four segments and six
tasks.
• The time required to complete all the operations is 4 + (6 - 1)
= 9 clock cycles, as indicated in the diagram.
•Next consider a non-pipeline unit that performs the
same operation and takes a time equal to tn to complete
each task.
•The total time required for n tasks is n*tn.
Pipelining
• The speed up a pipeline processing over an equivalent non
pipeline processing is defined by the ratio:
• Clock per Instruction CPI nearly equal to 1.
• Efficiency (Utilization)= Used block/Total blocks.
• Speedup= non-pipelined/pipelined
Arithmetic Pipelining
• Arithmetic Pipelines are mostly used in high-speed
computers.
• They are used to implement floating-point operations,
multiplication of fixed-point numbers, and similar
computations encountered in scientific problems.
• Floating-point operations are easily decomposed into sub-
operations.
• We will now show an example of a pipeline unit for floating-
point addition and subtraction.
Arithmetic Pipelining
• The inputs to the floating-point adder pipeline are two
normalized floating-point binary numbers.
X=A x 2a
Y=B x 2b
• A and B are two fractions that represent the mantissas and a
and b are the exponents.
• The floating-point addition and subtraction can be
performed in four segments.
• The registers labeled R are placed between the segments to
store intermediate results.
Arithmetic Pipelining
• The sub-operations that are performed in the four segments
are:
• 1. Compare the exponents.
• 2. Align the mantissas.
• 3. Add or subtract the mantissas.
• 4. Normalize the result.
• The exponents are compared by subtracting them to
determine their difference.
• The larger exponent is chosen as the exponent of the result.
Arithmetic Pipelining
• The exponent difference determines how many times the
mantissa associated with the smaller exponent must be shifted
to the right.
• This produces an alignment of the two mantissas.
• It should be noted that the shift must be designed as a
combinational circuit to reduce the shift time.
• The two mantissas are added or subtracted in segment 3.
• The result is normalized in segment 4.
• When an overflow occurs, the mantissa of the sum or difference
is shifted right and the exponent incremented by one.
Arithmetic Pipelining
• If an underflow occurs, the number of leading zeros in the
mantissa determines the number of left shifts in the
mantissa and the number that must be subtracted from the
exponent.
• X= 0.9504 X 103
• Y = 0.8200 X 102
• The two exponents are subtracted in the first segment to
obtain 3 - 2 = 1.
• The larger exponent 3 is chosen as the exponent of the
result.
Arithmetic Pipelining
• The next segment shifts the mantissa ofY to the right to
obtain
• X= 0.9504 X 103
• Y = 0.0820 X 103
• This aligns the two mantissas under the same exponent.
• The addition of the two mantissas in segment 3 produces
the sum
• Z = 1.0324 X 103
Arithmetic Pipelining
• The sum is adjusted by normalizing the result so that it has a
fraction with a nonzero first digit.
• This is done by shifting the mantissa once to the right and
incrementing the exponent by one to obtain the normalized
sum.
• Z = 0.10324 X 104
Instruction Pipelining
• Pipeline processing can occur not only in the data stream
but in the instruction stream as well.
• Most of the digital computers with complex instructions
require instruction pipeline to carry out operations like
fetch, decode and execute instructions.
• Computers with complex instructions require other phases
in addition to the fetch and execute to process an
instruction completely.
• In the most general case, the computer needs to process
each instruction with the following sequence of steps-
Instruction Pipelining
1. Fetch the instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.
4 – Segment Instruction Pipelining
• While an instruction is being executed in segment 4, the next
instruction in sequence is busy fetching an operand from
memory in segment 3.
• The effective address may be calculated in a separate
arithmetic circuit for the third instruction, and whenever the
memory is available, the fourth and all subsequent
instructions can be fetched and placed in an instruction FIFO.
• Thus up to four sub-operations in the instruction cycle can
overlap and up to four different instructions can be in progress
of being processed at the same time.
4 – Segment Instruction Pipelining
• The time in the horizontal axis is divided into steps of equal
duration.
• The four segments are represented in the diagram with an
abbreviated symbol.
• 1. Fl is the segment that fetches an instruction.
• 2. DA is the segment that decodes the instruction and
calculates the effective address.
• 3. FO is the segment that fetches the operand.
• 4. EX is the segment that executes the instruction.
4 – Segment Instruction Pipelining
• It is assumed that the processor has separate instruction and
data memories so that the operation in Fl and FO can
proceed at the same time.
• In a absence of a branch instruction, each segment operates
on different instructions.
• Thus, in step 4, instruction 1 is being executed in segment
EX; the operand for instruction 2 is being fetched in
segment FO; instruction 3 is being decoded in segment DA;
and instruction 4 is being fetched from memory in segment
FL.
4 – Segment Instruction Pipelining
• Assume now that instruction 3 is a branch instruction.
• As soon as this instruction is decoded in segment DA in step
4, the transfer from FI to DA of the other instructions is
halted until the branch instruction is executed in step 6.
• If the branch is taken, a new instruction is fetched in step 7.
• If the branch is not taken, the instruction fetched previously
in step 4 can be used.
• The pipeline then continues until a new branch instruction is
encountered.
4 – Segment Instruction Pipelining
• Another delay may occur in the pipeline if the EX segment
needs to store the result of the operation in the data
memory while the FO segment needs to fetch an operand.
• In that case, segment FO must wait until segment EX has
finished its operation.
• In general, there are three major difficulties that cause the
instruction pipeline to deviate from its normal operation.
1. Resource conflicts caused by access to memory by two
segments at the same time. Most of these conflicts can be
resolved by using separate instruction and data memories.
4 – Segment Instruction Pipelining
2. Data dependency conflicts arise when an instruction
depends on the result of a previous instruction, but this result
is not yet available.
3. Branch difficulties arise from branch and other instructions
that change the value of PC.
Timing of Instruction Pipeline
Pipelining Numerical
Q.1 Consider a pipeline having 4 phases with duration 60, 50,
90 and 80 ns. Given latch delay is 10 ns. Calculate-
1. Pipeline cycle time
2. Non-pipeline execution time
3. Speed up ratio
4. Pipeline time for 1000 tasks
5. Sequential time for 1000 tasks
1: Pipeline CycleTime-
Cycle time = Maximum delay due to any stage + Delay due to
its register
= Max { 60, 50, 90, 80 } + 10 ns
= 90 ns + 10 ns
= 100 ns
2: Non-Pipeline ExecutionTime-
Non-pipeline execution time for one instruction
= 60 ns + 50 ns + 90 ns + 80 ns
= 280 ns
3: Speed Up Ratio-
Speed up = Non-pipeline execution time / Pipeline execution
time
= 280 ns / Cycle time
= 280 ns / 100 ns
= 2.8
4: PipelineTime For 1000Tasks-
Pipeline time for 1000 tasks
=Time taken for 1st task +Time taken for remaining 999
tasks
= 1 x 4 clock cycles + 999 x 1 clock cycle
= 4 x cycle time + 999 x cycle time
= 4 x 100 ns + 999 x 100 ns
= 400 ns + 99900 ns
= 100300 ns
5: SequentialTime For 1000Tasks-
Non-pipeline time for 1000 tasks
= 1000 xTime taken for one task
= 1000 x 280 ns
= 280000 ns
Q.2 Consider a 4 stage pipeline processor. The number of
cycles needed by the four instructions I1, I2, I3 and I4 in
stages IF, ID, EX and WB. How many clock cycle will be
required to complete these 4 instructions in the given
pipeline system.
IF
ID EX WB
I1 1 3 2 1
I2 2 2 3 1
I3 1 1 1 1
I4 2 1 1 2
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13
IF I1 I2 I2 I3 I4 I4
ID I1 I1 I1 I2 I2 I3 I4
EX I1 I1 I2 I2 I2 I3 I4
WB I1 I2 I3 I4 I4
Q.3 Consider a 4 stage pipeline processor. The number of
cycles needed by the four instructions I1, I2, I3 and I4 in
stages S1, S2, S3 and S4 is shown below-What is the number
of cycles needed to execute the following loop?
for (i=1 to 2) { I1; I2; I3; I4; }
S1
S2 S3 S4
I1 2 1 1 1
I2 1 3 2 2
I3 2 1 1 3
I4 1 2 2 2
From here, number of clock cycles required to execute the
loop = 23 clock cycles.
Pipeline hazards and their resolution
There are mainly three types of Hazards possible in a
pipelined processor.
1) Structural Hazards
2) Control Hazards
3) Data Hazards
These dependencies (Hazards) may introduce stalls in the
pipeline.
Stall : A stall is a cycle in the pipeline without new input.
Structural Hazards
This dependency arises due to the resource conflict in the
pipeline. A resource conflict is a situation when more than
one instruction tries to access the same resource in the same
cycle.A resource can be a register, memory, or ALU.
INSTRUCTIO
N / CYCLE
1 2 3 4 5
I1 IF(Mem) ID EX Mem
I2 IF(Mem) ID EX Mem
I3 IF(Mem) ID EX
I4 IF(Mem) ID
Structural Hazards
• In the previous scenario, in cycle 4, instructions I1 and I4 are trying
to access same resource (Memory) which introduces a resource
conflict.
• To avoid this problem, we have to keep the instruction on wait
until the required resource (memory in our case) becomes
available.
Cycle 1 2 3 4 5 6 7 8
I1 IF(Mem) ID EX Mem
I2 IF(Mem) ID EX Mem
I3 IF(Mem) ID EX Mem
I4 - - - IF(Mem) ID
Structural Hazards
• Solution for structural dependency
To minimize structural dependency stalls in the pipeline, we
use a hardware mechanism called Renaming.
• Renaming : According to renaming, we divide the memory
into two independent modules used to store the instruction
and data separately called Code memory(CM) and Data
memory(DM) respectively.
• CM will contain all the instructions and DM will contain all
the operands that are required for the instructions.
Structural Hazards
Inst.
/Cycle
1 2 3 4 5 6 7
I1 IF(CM) ID EX DM WB
I2 IF(CM) ID EX DM WB
I3 IF(CM) ID EX DM WB
I4 IF(CM) ID EX DM
I5 IF(CM) ID EX
I6 IF(CM) ID
I7 IF(CM)
Control Dependency (Branch Hazards)
• This type of dependency occurs during the transfer of
control instructions such as BRANCH, CALL, JMP, etc.
• On many instruction architectures, the processor will not
know the target address of these instructions when it needs
to insert the new instruction into the pipeline.
• Due to this, unwanted instructions are fed to the pipeline.
• NOTE: Generally, the target address of the JMP instruction
is known after ID stage only.
Control Dependency (Branch Hazards)
• Consider the following sequence of instructions in the
program:
100: I1
101: I2 (JMP 250)
102: I3
.
.
250: BI1
• Expected output: I1 -> I2 -> BI1
• Output Sequence: I1 -> I2 -> I3 -> BI1
Control Dependency (Branch Hazards)
Inst.
/Cycle
1 2 3 4 5 6
I1 IF(Mem) ID EX Mem
I2 IF(Mem) ID EX Mem
I3 IF(Mem) ID EX Mem
BI1 IF(Mem) ID EX
So, the output sequence is not equal to the expected output,
that means the pipeline is not implemented correctly.
Control Dependency (Branch Hazards)
Inst.
/Cycle
1 2 3 4 5 6
I1 IF(Mem) ID EX Mem
I2 IF(Mem) ID EX Mem
DELAY - - - - - -
BI1 IF(Mem) ID EX
To correct the above problem we need to stop the Instruction
fetch until we get target address of branch instruction. This can
be implemented by introducing delay slot until we get the target
address.
Control Dependency (Branch Hazards)
• Output Sequence: I1 -> I2 -> Delay (Stall) -> BI1As the delay
slot performs no operation, this output sequence is equal to
the expected output sequence. But this slot introduces stall
in the pipeline.
• Solution for Control dependency Branch Prediction is the
method through which stalls due to control dependency can
be eliminated. In this at 1st stage prediction is done about
which branch will be taken. For branch prediction Branch
penalty is zero.
• Branch penalty : The number of stalls introduced during the
branch operations in the pipelined processor is known as
branch penalty.
Control Dependency (Branch Hazards)
• NOTE : As we see that the target address is available after
the ID stage, so the number of stalls introduced in the
pipeline is 1.
• Suppose, the branch target address would have been
present after the ALU stage, there would have been 2 stalls.
Generally, if the target address is present after the kth stage,
then there will be (k – 1) stalls in the pipeline.
• Total number of stalls introduced in the pipeline due to
branch instructions = Branch frequency * Branch Penalty
Data Hazards
• Data hazards occur when instructions that exhibit data
dependence, modify data in different stages of a pipeline.
Hazard cause delays in the pipeline. There are mainly three
types of data hazards:
• 1)RAW (Read afterWrite) [Flow/True data dependency]
2)WAR (Write after Read) [Anti-Data dependency]
3)WAW (Write afterWrite) [Output data dependency]
•WAR andWAW hazards occur during the out-of-order
execution of the instructions.
Data Hazards - RAW
• RAW hazard occurs when instruction I2 tries to read data
before instruction I1 writes it. Eg:
I1: R2 <- R1 + R3
I2: R4 <- R2 + R3
• Domain (I1)= R1,R3
• Domain (I2)= R2,R3
• Range (I1)= R2
• Range (I2)= R4
• Range (I1) ∩ Domain (I2) ≠ ɸ
Data Hazards - RAW
Cycle/Inst. 1 2 3 4 5
I1 IF/D OF EX WB
I2 IF/D OF EX WB
Cycle/I
nst.
1 2 3 4 5 6 7 8
I1 IF/D OF EX WB
I2 - - - IF/D OF EX WB
Data Hazards - WAR
• WAR hazard occurs when instruction I2 tries to write data
before instruction I1 reads it. Eg:
I1: R2 <- R1 + R3
I2: R3 <- R4 + R5
• Domain (I1)= R1,R3
• Domain (I2)= R4,R5
• Range (I1)= R2
• Range (I2)= R3
• Domain (I1) ∩ Range (I2) ≠ ɸ
Data Hazards - WAW
• WAW hazard occurs when instruction I2 tries to write output
before instruction I1 writes it. Eg:
I1: R2 <- R1 + R3
I2: R2 <- R4 + R5
• Domain (I1)= R1,R3
• Domain (I2)= R4,R5
• Range (I1)= R2
• Range (I2)= R2
• Range(I1) ∩ Range (I2) ≠ ɸ
Array Processor
• An array processor is a processor that performs computations
on large arrays of data.
• The term is used to refer to two different types of processors.
• An attached array processor is an auxiliary processor
attached to a general-purpose computer.
• It is intended to improve the performance of the host
computer in specific numerical computation tasks.
• An SIMD array processor is a processor that has a single-
instruction multiple-data organization.
Array Processor
• It manipulates vector instructions by means of multiple
functional units responding to a common instruction.
• Although both types of array processors manipulate vectors,
their internal organization is different.
• The system with the attached processor satisfies the needs
for complex arithmetic applications.
• The objective of the attached array processor is to provide
vector manipulation capabilities to a conventional computer
at a fraction of the cost of supercomputers.
AttachedArray Processor
• An attached array processor is designed as a peripheral for a
conventional host computer, and its purpose is to enhance
the performance of the computer by providing vector
processing for complex scientific applications.
• It achieves high performance by means of parallel processing
with multiple functional units.
• It includes an arithmetic unit containing one or more
pipelined floating-point adders and multipliers.
• The array processor can be programmed by the user to
accommodate a variety of complex arithmetic problems.
AttachedArray Processor
• The host computer is a general-purpose commercial
computer and the attached processor is a back-end machine
driven by the host computer.
• The array processor is connected through an input-output
controller to the computer and the computer treats it like an
external interface.
• The data for the attached processor are transferred from
main memory to a local memory through a high-speed bus.
• The general-purpose computer without the attached
processor serves the users that need conventional data
processing.
Attached Array Processor with Host Computer
SIMDArray Processor
• An SIMD array processor is a computer with multiple
processing units operating in parallel.
• The processing units are synchronized to perform the same
operation under the control of a common control unit, thus
providing a single instruction stream, multiple data stream
(SIMD) organization.
• It contains a set of identical processing elements (PEs), each
having a local memory M.
• Each processor element includes an ALU, a floating-point
arithmetic unit, and working registers.
SIMDArray Processor
• The master control unit controls the operations in the
processor elements.
• The main memory is used for storage of the program.
• The function of the master control unit is to decode the
instructions and determine how the instruction is to be
executed.
• Scalar and program control instructions are directly executed
within the master control unit. Vector instructions are
broadcast to all PEs simultaneously.
• Each PE uses operands stored in its local memory.
SIMDArray Processor
• Consider, for example, the vector addition C = A + B.
• The master control unit first stores the ith components ai and
bi of A and B in local memory M, for i = 1, 2, 3, . . . , n.
• It then broadcasts floating-point add instruction ci = ai + bi to
all PEs, causing the addition to take place simultaneously.
• The components of ci are stored in fixed locations in each
local memory.
• This produces the desired vector sum in one add cycle.
SIMDArray Processor
• Masking schemes are used to control the status of each PE
during the execution of vector instructions.
• Each PE has a flag that is set when the PE is active and reset
when the PE is inactive.
• This ensures that only those PEs that need to participate are
active during the execution of the instruction.
• For example, suppose that the array processor contains a set
of 64 PEs. If a vector length of less than 64 data items is to be
processed the control unit selects the proper number of PEs
to be active.
Vector Processing
• There is a class of computational problems that are beyond
the capabilities of a conventional computer.
• These problems are characterized by the fact that they
require a vast number of computations that will take a
conventional computer days or even weeks to complete.
• Computers with vector processing capabilities are in demand
in specialized applications.
• The following are representative application areas where
vector processing is of the utmost importance.
Vector Processing - Applications
• Long-range weather forecasting
• Petroleum explorations
• Medical diagnosis
• Aerodynamics and space flight simulations
• Artificial intelligence and expert systems
• Image processing
Vector Operations
• Many scientific problems require arithmetic operations on
large arrays of numbers.
• These numbers are usually formulated as vectors and
matrices of floating-point numbers.
• A vector is an ordered set of a one-dimensional array of data
items.
• A vector V of length n is represented as a row vector by V =
[V1 V2 V3 · · · Vn]. It may be represented as a column vector if
the data items are listed in a column.
• A conventional sequential computer is capable of processing
operands one at a time.
Vector Operations
• Consequently, operations on vectors must be broken down
into single computations with subscripted variables.
• The element Vi of vector V is written as V(I) and the index I
refers to a memory address or register where the number is
stored.
• To examine the difference between a conventional scalar
processor and a vector processor, consider the following
Fortran DO loop:
DO 20 I= 1, 100
20 C (I) = B(I) + A ( I)
Vector Operations
• This is a program for adding two vectors A and B of length 100 to
produce a vector C. This is implemented in machine language by
the following sequence of operations.
Initialize I= 0
20 Read A(I)
Read B(I)
Store C(I) = A ( I ) + B(I)
Increment I= I+ 1
If I<=100 go to 20
Continue
Vector Operations
• This constitutes a program loop that reads a pair of operands
from arrays A and B and performs a floating-point addition.
• The loop control variable is then updated and the steps
repeat 100 times.
• A computer capable of vector processing eliminates the
overhead associated with the time it takes to fetch and
execute the instructions in the program loop.
• It allows operations to be specified with a single vector
instruction of the form
C(1 : 100) = A(1 : 100) + B(1 : 100)
General ProcessingVsVector Processing
Vector Instruction
• The vector instruction includes the initial address of the
operands, the length of the vectors, and the operation to be
performed, all in one composite instruction.
• This is essentially a three-address instruction with three fields
specifying the base address of the operands and an additional
field that gives the length of the data items in the vectors.
This assumes that the vector operands reside in memory.
Vector Instruction
1. Operation Code:- Operation code indicates the operation that has to
be performed in the given instruction. It decides the functional unit for
the specified operation or reconfigures the multifunction unit.
2. Base Address:- Base address field refers to the memory
location from where the operands are to be fetched or to where the
result has to be stored. The base address is found in the memory
reference instructions. In the vector instruction, the operand and the
result both are stored in the vector registers. Here, the base
address refers to the designated vector register.
3. Vector Length:-Vector length specifies the number of elements in a
vector operand. It identifies the termination of a vector instruction.

More Related Content

What's hot

Computer Architecture and organization ppt.
Computer Architecture and organization ppt.Computer Architecture and organization ppt.
Computer Architecture and organization ppt.mali yogesh kumar
 
Microprogram Control
Microprogram Control Microprogram Control
Microprogram Control Anuj Modi
 
Pipelining of Processors Computer Architecture
Pipelining of  Processors Computer ArchitecturePipelining of  Processors Computer Architecture
Pipelining of Processors Computer ArchitectureHaris456
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processingAcad
 
Types of instructions
Types of instructionsTypes of instructions
Types of instructionsihsanjamil
 
Register transfer language & its micro operations
Register transfer language & its micro operationsRegister transfer language & its micro operations
Register transfer language & its micro operationsLakshya Sharma
 
Arithmetic and RISC pipeline
Arithmetic and RISC pipelineArithmetic and RISC pipeline
Arithmetic and RISC pipelineManviGautam2
 
Introduction to the Kernel Chapter 2 Mrs.Sowmya Jyothi
Introduction to the Kernel  Chapter 2 Mrs.Sowmya JyothiIntroduction to the Kernel  Chapter 2 Mrs.Sowmya Jyothi
Introduction to the Kernel Chapter 2 Mrs.Sowmya JyothiSowmya Jyothi
 
Instruction Cycle in Computer Organization.pptx
Instruction Cycle in Computer Organization.pptxInstruction Cycle in Computer Organization.pptx
Instruction Cycle in Computer Organization.pptxYash346903
 
Unit 5 I/O organization
Unit 5   I/O organizationUnit 5   I/O organization
Unit 5 I/O organizationchidabdu
 
Arithmetic for Computers
Arithmetic for ComputersArithmetic for Computers
Arithmetic for ComputersMD. ABU TALHA
 
28. 8251 programmable communication interface
28. 8251 programmable communication interface28. 8251 programmable communication interface
28. 8251 programmable communication interfacesandip das
 
Multiplication of two 3 d sparse matrices using 1d arrays and linked lists
Multiplication of two 3 d sparse matrices using 1d arrays and linked listsMultiplication of two 3 d sparse matrices using 1d arrays and linked lists
Multiplication of two 3 d sparse matrices using 1d arrays and linked listsDr Sandeep Kumar Poonia
 
Register transfer language
Register transfer languageRegister transfer language
Register transfer languageSanjeev Patel
 
Chapter 03 arithmetic for computers
Chapter 03   arithmetic for computersChapter 03   arithmetic for computers
Chapter 03 arithmetic for computersBảo Hoang
 

What's hot (20)

Computer Architecture and organization ppt.
Computer Architecture and organization ppt.Computer Architecture and organization ppt.
Computer Architecture and organization ppt.
 
Microprogram Control
Microprogram Control Microprogram Control
Microprogram Control
 
Pipelining of Processors Computer Architecture
Pipelining of  Processors Computer ArchitecturePipelining of  Processors Computer Architecture
Pipelining of Processors Computer Architecture
 
CO By Rakesh Roshan
CO By Rakesh RoshanCO By Rakesh Roshan
CO By Rakesh Roshan
 
Arithmetic micro operations
Arithmetic micro operationsArithmetic micro operations
Arithmetic micro operations
 
Chapter 8
Chapter 8Chapter 8
Chapter 8
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processing
 
Types of instructions
Types of instructionsTypes of instructions
Types of instructions
 
control unit
control unitcontrol unit
control unit
 
Microoperations
MicrooperationsMicrooperations
Microoperations
 
Register transfer language & its micro operations
Register transfer language & its micro operationsRegister transfer language & its micro operations
Register transfer language & its micro operations
 
Arithmetic and RISC pipeline
Arithmetic and RISC pipelineArithmetic and RISC pipeline
Arithmetic and RISC pipeline
 
Introduction to the Kernel Chapter 2 Mrs.Sowmya Jyothi
Introduction to the Kernel  Chapter 2 Mrs.Sowmya JyothiIntroduction to the Kernel  Chapter 2 Mrs.Sowmya Jyothi
Introduction to the Kernel Chapter 2 Mrs.Sowmya Jyothi
 
Instruction Cycle in Computer Organization.pptx
Instruction Cycle in Computer Organization.pptxInstruction Cycle in Computer Organization.pptx
Instruction Cycle in Computer Organization.pptx
 
Unit 5 I/O organization
Unit 5   I/O organizationUnit 5   I/O organization
Unit 5 I/O organization
 
Arithmetic for Computers
Arithmetic for ComputersArithmetic for Computers
Arithmetic for Computers
 
28. 8251 programmable communication interface
28. 8251 programmable communication interface28. 8251 programmable communication interface
28. 8251 programmable communication interface
 
Multiplication of two 3 d sparse matrices using 1d arrays and linked lists
Multiplication of two 3 d sparse matrices using 1d arrays and linked listsMultiplication of two 3 d sparse matrices using 1d arrays and linked lists
Multiplication of two 3 d sparse matrices using 1d arrays and linked lists
 
Register transfer language
Register transfer languageRegister transfer language
Register transfer language
 
Chapter 03 arithmetic for computers
Chapter 03   arithmetic for computersChapter 03   arithmetic for computers
Chapter 03 arithmetic for computers
 

Similar to Unit - 5 Pipelining.pptx

ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptx
ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptxehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptx
ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptxEliasPetros
 
Pipeline and Vector Processing Computer Org. Architecture.pptx
Pipeline and Vector Processing Computer Org. Architecture.pptxPipeline and Vector Processing Computer Org. Architecture.pptx
Pipeline and Vector Processing Computer Org. Architecture.pptxitofficial07
 
Simple CPU Instruction Set Design.pptx
Simple CPU Instruction Set Design.pptxSimple CPU Instruction Set Design.pptx
Simple CPU Instruction Set Design.pptxssuser3aa461
 
Pipelining in Computer System Achitecture
Pipelining in Computer System AchitecturePipelining in Computer System Achitecture
Pipelining in Computer System AchitectureYashiUpadhyay3
 
Control unit design
Control unit designControl unit design
Control unit designDhaval Bagal
 
CMPN301-Pipelining_V2.pptx
CMPN301-Pipelining_V2.pptxCMPN301-Pipelining_V2.pptx
CMPN301-Pipelining_V2.pptxNadaAAmin
 
pipelining-190913185902.pptx
pipelining-190913185902.pptxpipelining-190913185902.pptx
pipelining-190913185902.pptxAshokRachapalli1
 
Lecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptLecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptRaJibRaju3
 
Introduction_pipeline24.ppt which include
Introduction_pipeline24.ppt which includeIntroduction_pipeline24.ppt which include
Introduction_pipeline24.ppt which includeGauravDaware2
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with PipeliningAneesh Raveendran
 

Similar to Unit - 5 Pipelining.pptx (20)

ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptx
ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptxehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptx
ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptx
 
Pipeline and Vector Processing Computer Org. Architecture.pptx
Pipeline and Vector Processing Computer Org. Architecture.pptxPipeline and Vector Processing Computer Org. Architecture.pptx
Pipeline and Vector Processing Computer Org. Architecture.pptx
 
Pipelining slides
Pipelining slides Pipelining slides
Pipelining slides
 
Coa.ppt2
Coa.ppt2Coa.ppt2
Coa.ppt2
 
CO Module 5
CO Module 5CO Module 5
CO Module 5
 
Simple CPU Instruction Set Design.pptx
Simple CPU Instruction Set Design.pptxSimple CPU Instruction Set Design.pptx
Simple CPU Instruction Set Design.pptx
 
Pipelining in Computer System Achitecture
Pipelining in Computer System AchitecturePipelining in Computer System Achitecture
Pipelining in Computer System Achitecture
 
Control unit design
Control unit designControl unit design
Control unit design
 
Unit iii
Unit iiiUnit iii
Unit iii
 
Unit 4 COA.pptx
Unit 4 COA.pptxUnit 4 COA.pptx
Unit 4 COA.pptx
 
CMPN301-Pipelining_V2.pptx
CMPN301-Pipelining_V2.pptxCMPN301-Pipelining_V2.pptx
CMPN301-Pipelining_V2.pptx
 
Co m1-1
Co m1-1Co m1-1
Co m1-1
 
pipelining-190913185902.pptx
pipelining-190913185902.pptxpipelining-190913185902.pptx
pipelining-190913185902.pptx
 
Lecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptLecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.ppt
 
Digital-Unit-III.ppt
Digital-Unit-III.pptDigital-Unit-III.ppt
Digital-Unit-III.ppt
 
Introduction_pipeline24.ppt which include
Introduction_pipeline24.ppt which includeIntroduction_pipeline24.ppt which include
Introduction_pipeline24.ppt which include
 
3 Pipelining
3 Pipelining3 Pipelining
3 Pipelining
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with Pipelining
 
COA Unit-5.pptx
COA Unit-5.pptxCOA Unit-5.pptx
COA Unit-5.pptx
 

More from Medicaps University

More from Medicaps University (12)

data mining and warehousing computer science
data mining and warehousing computer sciencedata mining and warehousing computer science
data mining and warehousing computer science
 
Unit-4 (IO Interface).pptx
Unit-4 (IO Interface).pptxUnit-4 (IO Interface).pptx
Unit-4 (IO Interface).pptx
 
UNIT-2.pptx
UNIT-2.pptxUNIT-2.pptx
UNIT-2.pptx
 
Scheduling
SchedulingScheduling
Scheduling
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File Systems
 
Clock synchronization
Clock synchronizationClock synchronization
Clock synchronization
 
Distributed Objects and Remote Invocation
Distributed Objects and Remote InvocationDistributed Objects and Remote Invocation
Distributed Objects and Remote Invocation
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 
Clustering - K-Means, DBSCAN
Clustering - K-Means, DBSCANClustering - K-Means, DBSCAN
Clustering - K-Means, DBSCAN
 
Association and Classification Algorithm
Association and Classification AlgorithmAssociation and Classification Algorithm
Association and Classification Algorithm
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...
Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...
Data Warehousing (Need,Application,Architecture,Benefits), Data Mart, Schema,...
 

Recently uploaded

Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEselvakumar948
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 

Recently uploaded (20)

Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 

Unit - 5 Pipelining.pptx

  • 1. MEDICAPS UNIVERSITY UNIT - 5 Course Code Course Name Hours PerWeek Total Credits L T P IT3CO20 Computer SystemArchitecture 3 1 2 5 FACULTY OF ENGINEERING Department of InformationTechnology
  • 2. Syllabus •Parallel Processing, Pipeline Processing, Instruction and Arithmetic Pipeline, Pipeline hazards and their resolution, •Vector Processing – vector operations, memory interleaving, matrix multiplication, Supercomputers, •Array Processors – attached and SIMD array processors
  • 3. Parallel Processing • Parallel processing can be described as a class of techniques which enables the system to achieve simultaneous data- processing tasks to increase the computational speed of a computer system. • A parallel processing system can carry out simultaneous data- processing to achieve faster execution time. For instance, while an instruction is being processed in the ALU component of the CPU, the next instruction can be read from memory. • The primary purpose of parallel processing is to enhance the computer processing capability and increase its throughput.
  • 4. Parallel Processing • The term parallel processing indicates that the system is able to perform several operations in a single time. • Now we will elaborate the scenario, in a CPU we will be having only one Accumulator which will be storing the results obtained from the current operation. • Now if we are giving only one command such that “a+b” then the CPU performs the operation and stores the result in the accumulator. • Now we are talking about parallel processing, therefore we will be issuing two instructions “a+b” and “c-d” in the same time,
  • 5. Parallel Processing • now if the result of “a+b” operation is stored in the accumulator, then “c-d” result cannot be stored in the accumulator in the same time. • Therefore the term parallel processing in not only based on the Arithmetic, logic or shift operations. • The above problem can be solved in the following manner: • Consider the registers R1 and R2 which will be storing the operands before operation and R3 is the register which will be storing the results after the operations. • Now the above two instructions “a+b” and “c-d” will be done in parallel as follows.
  • 6. Parallel Processing • Values of “a” and “b” are fetched in to the registers R1 and R2 • The values of R1 and R2 will be sent into the ALU unit to perform the addition. • The result will be stored in the Accumulator. • When the ALU unit is performing the calculation, the next data “c” and “d” are brought into R1 and R2. • Finally the value of Accumulator obtained from “a+b” will be transferred into the R3.
  • 7. Parallel Processing • Next the values of C and D from R1 and R2 will be brought into the ALU to perform the “c-d” operation. • Since the accumulator value of the previous operation is present in R3, the result of “c-d” can be safely stored in the Accumulator. • This is the process of parallel processing of only one CPU. • Consider several such CPU performing the calculations separately. • This is the concept of parallel processing.
  • 9. Parallel Processing • In the above figure we can see that the data stored in the processor registers is being sent to separate devices basing on the operation needed on the data. • If the data inside the processor registers is requesting for an arithmetic operation, then the data will be sent to the arithmetic unit and if in the same time another data is requested in the logic unit, then the data will be sent to logic unit for logical operations. • Now in the same time both arithmetic operations and logical operations are executing in parallel.This is called as parallel processing. • Instruction Stream: The sequence of instructions read from the memory is called as an Instruction Stream • Data Stream: The operations performed on the data in the processor is called as a Data Stream.
  • 10. Parallel Processing • A system may have two or more processor operating concurrently. • For example, the arithmetic, logic, and shift operations can be separated into three units and the operands diverted to each unit under the supervision of a control unit. • Parallel processing is established by distributing the data among the multiple functional units. • In this chapter we consider parallel processing under the following main topics: • 1. Pipeline processing 2.Vector processing 3. Array processors
  • 11.
  • 12. Pipelining • A program consists of several number of instructions. • These instructions may be executed in the following two ways- • 1) Non-Pipelined Execution 2) Pipelined Execution • 1) Non-Pipelined Execution • All the instructions of a program are executed sequentially one after the other. • A new instruction executes only after the previous instruction has executed completely. • This style of executing the instructions is highly inefficient.
  • 13. 1.) Non-Pipelined Execution • Consider a program consisting of three instructions. • In a non-pipelined architecture, these instructions execute one after the other as- • If time taken for executing one instruction = t, then- • Time taken for executing ‘n’ instructions = n x t
  • 15. 2) Pipelined Execution • In pipelined architecture, • Multiple instructions are executed parallelly. • This style of executing the instructions is highly efficient. • A pipelined processor does not wait until the previous instruction has executed completely. • Rather, it fetches the next instruction and begins its execution. • The hardware of the CPU is split up into several functional units. • Each functional unit performs a dedicated task.
  • 17. 2) Pipelined Execution • The number of functional units may vary from processor to processor. • These functional units are called as stages of the pipeline. • Control unit manages all the stages using control signals. • There is a register associated with each stage that holds the data. • There is a global clock that synchronizes the working of all the stages. • At the beginning of each clock cycle, each stage takes the input from its register. • Each stage then processes the data and feed its output to the register of the next stage.
  • 18.
  • 19. Pipelining • Pipelining is a technique of decomposing a sequential process into sub-operations, with each subprocess being executed in a special dedicated segment that operates concurrently with all other segments. • A pipeline can be visualized as a collection of processing segments through which binary information flows. • Each segment performs partial processing dictated by the way the task is partitioned. • The result obtained from the computation in each segment is transferred to the next segment in the pipeline.
  • 20. Pipelining • The final result is obtained after the data have passed through all segments. • It is characteristic of pipelines that several computations can be in progress in distinct segments at the same time. • The overlapping of computation is made possible by associating a register with each segment in the pipeline. • The registers provide isolation between each segment so that each can operate on distinct data simultaneously.
  • 21. Pipelining • A clock is applied to all registers after enough time has elapsed to perform all segment activity. • Suppose that we want to perform the combined multiply and add operations with a stream of numbers. Ai * Bi + Ci for i = 1,2,3,4,5,6,7 • Each sub-operation is to be implemented in a segment within a pipeline. • Each segment has one or two registers and a combinational circuit
  • 22. Pipelining •R 1 through RS are registers that receive new data with every clock pulse. •The multiplier and adder are combinational circuits. The sub-operations performed in each segment of the pipeline are as follows: •R 1 <-- Ai, R2 <-- Bi Input Ai and Bi •R3 <--R 1 * R2, R4 <-- Ci Multiply and input Ci •R5 <--R3 + R4 Add Ci to product
  • 23.
  • 24.
  • 25. Pipelining • The five registers are loaded with new data at every clock pulse. • Any operation that can be decomposed into a sequence of sub-operations of about the same complexity can be implemented by a pipeline processor. • The technique is efficient for those applications that need to repeat the same task many times with different sets of data. • The operands pass through all four segments in a fixed sequence
  • 26. Pipelining • Each segment consists of a combinational circuit Si that performs a sub-operation over the data stream flowing through the pipe. • The segments are separated by registers Ri that hold the intermediate results between the stages. • Information flows between adjacent stages under the control of a common clock applied to all the registers simultaneously. • The behavior of a pipeline can be illustrated with a space- time diagram.
  • 27. Four-Stage Pipeline- • In four stage pipelined architecture, the execution of each instruction is completed in following 4 stages- 1. Instruction fetch (IF) 2. Instruction decode (ID) 3. Instruction Execute (IE) 4. Write back (WB) • The hardware of the CPU is divided into four functional units. • Each functional unit perform a dedicated task.
  • 29. SpaceTime Diagram for Pipelining
  • 30. Pipelining • Now consider the case where a k-segment pipeline with a clock cycle time tp is used to execute n tasks. • The first task T1 requires a time equal to k*tp to complete its operation since there are k segments in the pipe. • The remaining n - 1 tasks emerge from the pipe at the rate of one task per clock cycle and they will be completed after a time equal to (n - 1)*tp. • Therefore, to complete n tasks using a k-segment pipeline requires k + (n - 1) clock cycles.
  • 31. Pipelining • For example, the diagram shows four segments and six tasks. • The time required to complete all the operations is 4 + (6 - 1) = 9 clock cycles, as indicated in the diagram. •Next consider a non-pipeline unit that performs the same operation and takes a time equal to tn to complete each task. •The total time required for n tasks is n*tn.
  • 32. Pipelining • The speed up a pipeline processing over an equivalent non pipeline processing is defined by the ratio: • Clock per Instruction CPI nearly equal to 1. • Efficiency (Utilization)= Used block/Total blocks. • Speedup= non-pipelined/pipelined
  • 33. Arithmetic Pipelining • Arithmetic Pipelines are mostly used in high-speed computers. • They are used to implement floating-point operations, multiplication of fixed-point numbers, and similar computations encountered in scientific problems. • Floating-point operations are easily decomposed into sub- operations. • We will now show an example of a pipeline unit for floating- point addition and subtraction.
  • 34. Arithmetic Pipelining • The inputs to the floating-point adder pipeline are two normalized floating-point binary numbers. X=A x 2a Y=B x 2b • A and B are two fractions that represent the mantissas and a and b are the exponents. • The floating-point addition and subtraction can be performed in four segments. • The registers labeled R are placed between the segments to store intermediate results.
  • 35. Arithmetic Pipelining • The sub-operations that are performed in the four segments are: • 1. Compare the exponents. • 2. Align the mantissas. • 3. Add or subtract the mantissas. • 4. Normalize the result. • The exponents are compared by subtracting them to determine their difference. • The larger exponent is chosen as the exponent of the result.
  • 36. Arithmetic Pipelining • The exponent difference determines how many times the mantissa associated with the smaller exponent must be shifted to the right. • This produces an alignment of the two mantissas. • It should be noted that the shift must be designed as a combinational circuit to reduce the shift time. • The two mantissas are added or subtracted in segment 3. • The result is normalized in segment 4. • When an overflow occurs, the mantissa of the sum or difference is shifted right and the exponent incremented by one.
  • 37. Arithmetic Pipelining • If an underflow occurs, the number of leading zeros in the mantissa determines the number of left shifts in the mantissa and the number that must be subtracted from the exponent. • X= 0.9504 X 103 • Y = 0.8200 X 102 • The two exponents are subtracted in the first segment to obtain 3 - 2 = 1. • The larger exponent 3 is chosen as the exponent of the result.
  • 38. Arithmetic Pipelining • The next segment shifts the mantissa ofY to the right to obtain • X= 0.9504 X 103 • Y = 0.0820 X 103 • This aligns the two mantissas under the same exponent. • The addition of the two mantissas in segment 3 produces the sum • Z = 1.0324 X 103
  • 39. Arithmetic Pipelining • The sum is adjusted by normalizing the result so that it has a fraction with a nonzero first digit. • This is done by shifting the mantissa once to the right and incrementing the exponent by one to obtain the normalized sum. • Z = 0.10324 X 104
  • 40.
  • 41.
  • 42.
  • 43. Instruction Pipelining • Pipeline processing can occur not only in the data stream but in the instruction stream as well. • Most of the digital computers with complex instructions require instruction pipeline to carry out operations like fetch, decode and execute instructions. • Computers with complex instructions require other phases in addition to the fetch and execute to process an instruction completely. • In the most general case, the computer needs to process each instruction with the following sequence of steps-
  • 44. Instruction Pipelining 1. Fetch the instruction from memory. 2. Decode the instruction. 3. Calculate the effective address. 4. Fetch the operands from memory. 5. Execute the instruction. 6. Store the result in the proper place.
  • 45. 4 – Segment Instruction Pipelining • While an instruction is being executed in segment 4, the next instruction in sequence is busy fetching an operand from memory in segment 3. • The effective address may be calculated in a separate arithmetic circuit for the third instruction, and whenever the memory is available, the fourth and all subsequent instructions can be fetched and placed in an instruction FIFO. • Thus up to four sub-operations in the instruction cycle can overlap and up to four different instructions can be in progress of being processed at the same time.
  • 46.
  • 47. 4 – Segment Instruction Pipelining • The time in the horizontal axis is divided into steps of equal duration. • The four segments are represented in the diagram with an abbreviated symbol. • 1. Fl is the segment that fetches an instruction. • 2. DA is the segment that decodes the instruction and calculates the effective address. • 3. FO is the segment that fetches the operand. • 4. EX is the segment that executes the instruction.
  • 48. 4 – Segment Instruction Pipelining • It is assumed that the processor has separate instruction and data memories so that the operation in Fl and FO can proceed at the same time. • In a absence of a branch instruction, each segment operates on different instructions. • Thus, in step 4, instruction 1 is being executed in segment EX; the operand for instruction 2 is being fetched in segment FO; instruction 3 is being decoded in segment DA; and instruction 4 is being fetched from memory in segment FL.
  • 49. 4 – Segment Instruction Pipelining • Assume now that instruction 3 is a branch instruction. • As soon as this instruction is decoded in segment DA in step 4, the transfer from FI to DA of the other instructions is halted until the branch instruction is executed in step 6. • If the branch is taken, a new instruction is fetched in step 7. • If the branch is not taken, the instruction fetched previously in step 4 can be used. • The pipeline then continues until a new branch instruction is encountered.
  • 50. 4 – Segment Instruction Pipelining • Another delay may occur in the pipeline if the EX segment needs to store the result of the operation in the data memory while the FO segment needs to fetch an operand. • In that case, segment FO must wait until segment EX has finished its operation. • In general, there are three major difficulties that cause the instruction pipeline to deviate from its normal operation. 1. Resource conflicts caused by access to memory by two segments at the same time. Most of these conflicts can be resolved by using separate instruction and data memories.
  • 51. 4 – Segment Instruction Pipelining 2. Data dependency conflicts arise when an instruction depends on the result of a previous instruction, but this result is not yet available. 3. Branch difficulties arise from branch and other instructions that change the value of PC.
  • 53. Pipelining Numerical Q.1 Consider a pipeline having 4 phases with duration 60, 50, 90 and 80 ns. Given latch delay is 10 ns. Calculate- 1. Pipeline cycle time 2. Non-pipeline execution time 3. Speed up ratio 4. Pipeline time for 1000 tasks 5. Sequential time for 1000 tasks
  • 54. 1: Pipeline CycleTime- Cycle time = Maximum delay due to any stage + Delay due to its register = Max { 60, 50, 90, 80 } + 10 ns = 90 ns + 10 ns = 100 ns 2: Non-Pipeline ExecutionTime- Non-pipeline execution time for one instruction = 60 ns + 50 ns + 90 ns + 80 ns = 280 ns
  • 55. 3: Speed Up Ratio- Speed up = Non-pipeline execution time / Pipeline execution time = 280 ns / Cycle time = 280 ns / 100 ns = 2.8 4: PipelineTime For 1000Tasks- Pipeline time for 1000 tasks =Time taken for 1st task +Time taken for remaining 999 tasks = 1 x 4 clock cycles + 999 x 1 clock cycle
  • 56. = 4 x cycle time + 999 x cycle time = 4 x 100 ns + 999 x 100 ns = 400 ns + 99900 ns = 100300 ns 5: SequentialTime For 1000Tasks- Non-pipeline time for 1000 tasks = 1000 xTime taken for one task = 1000 x 280 ns = 280000 ns
  • 57. Q.2 Consider a 4 stage pipeline processor. The number of cycles needed by the four instructions I1, I2, I3 and I4 in stages IF, ID, EX and WB. How many clock cycle will be required to complete these 4 instructions in the given pipeline system. IF ID EX WB I1 1 3 2 1 I2 2 2 3 1 I3 1 1 1 1 I4 2 1 1 2
  • 58. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 IF I1 I2 I2 I3 I4 I4 ID I1 I1 I1 I2 I2 I3 I4 EX I1 I1 I2 I2 I2 I3 I4 WB I1 I2 I3 I4 I4
  • 59. Q.3 Consider a 4 stage pipeline processor. The number of cycles needed by the four instructions I1, I2, I3 and I4 in stages S1, S2, S3 and S4 is shown below-What is the number of cycles needed to execute the following loop? for (i=1 to 2) { I1; I2; I3; I4; } S1 S2 S3 S4 I1 2 1 1 1 I2 1 3 2 2 I3 2 1 1 3 I4 1 2 2 2
  • 60. From here, number of clock cycles required to execute the loop = 23 clock cycles.
  • 61. Pipeline hazards and their resolution There are mainly three types of Hazards possible in a pipelined processor. 1) Structural Hazards 2) Control Hazards 3) Data Hazards These dependencies (Hazards) may introduce stalls in the pipeline. Stall : A stall is a cycle in the pipeline without new input.
  • 62. Structural Hazards This dependency arises due to the resource conflict in the pipeline. A resource conflict is a situation when more than one instruction tries to access the same resource in the same cycle.A resource can be a register, memory, or ALU. INSTRUCTIO N / CYCLE 1 2 3 4 5 I1 IF(Mem) ID EX Mem I2 IF(Mem) ID EX Mem I3 IF(Mem) ID EX I4 IF(Mem) ID
  • 63. Structural Hazards • In the previous scenario, in cycle 4, instructions I1 and I4 are trying to access same resource (Memory) which introduces a resource conflict. • To avoid this problem, we have to keep the instruction on wait until the required resource (memory in our case) becomes available. Cycle 1 2 3 4 5 6 7 8 I1 IF(Mem) ID EX Mem I2 IF(Mem) ID EX Mem I3 IF(Mem) ID EX Mem I4 - - - IF(Mem) ID
  • 64. Structural Hazards • Solution for structural dependency To minimize structural dependency stalls in the pipeline, we use a hardware mechanism called Renaming. • Renaming : According to renaming, we divide the memory into two independent modules used to store the instruction and data separately called Code memory(CM) and Data memory(DM) respectively. • CM will contain all the instructions and DM will contain all the operands that are required for the instructions.
  • 65. Structural Hazards Inst. /Cycle 1 2 3 4 5 6 7 I1 IF(CM) ID EX DM WB I2 IF(CM) ID EX DM WB I3 IF(CM) ID EX DM WB I4 IF(CM) ID EX DM I5 IF(CM) ID EX I6 IF(CM) ID I7 IF(CM)
  • 66. Control Dependency (Branch Hazards) • This type of dependency occurs during the transfer of control instructions such as BRANCH, CALL, JMP, etc. • On many instruction architectures, the processor will not know the target address of these instructions when it needs to insert the new instruction into the pipeline. • Due to this, unwanted instructions are fed to the pipeline. • NOTE: Generally, the target address of the JMP instruction is known after ID stage only.
  • 67. Control Dependency (Branch Hazards) • Consider the following sequence of instructions in the program: 100: I1 101: I2 (JMP 250) 102: I3 . . 250: BI1 • Expected output: I1 -> I2 -> BI1 • Output Sequence: I1 -> I2 -> I3 -> BI1
  • 68. Control Dependency (Branch Hazards) Inst. /Cycle 1 2 3 4 5 6 I1 IF(Mem) ID EX Mem I2 IF(Mem) ID EX Mem I3 IF(Mem) ID EX Mem BI1 IF(Mem) ID EX So, the output sequence is not equal to the expected output, that means the pipeline is not implemented correctly.
  • 69. Control Dependency (Branch Hazards) Inst. /Cycle 1 2 3 4 5 6 I1 IF(Mem) ID EX Mem I2 IF(Mem) ID EX Mem DELAY - - - - - - BI1 IF(Mem) ID EX To correct the above problem we need to stop the Instruction fetch until we get target address of branch instruction. This can be implemented by introducing delay slot until we get the target address.
  • 70. Control Dependency (Branch Hazards) • Output Sequence: I1 -> I2 -> Delay (Stall) -> BI1As the delay slot performs no operation, this output sequence is equal to the expected output sequence. But this slot introduces stall in the pipeline. • Solution for Control dependency Branch Prediction is the method through which stalls due to control dependency can be eliminated. In this at 1st stage prediction is done about which branch will be taken. For branch prediction Branch penalty is zero. • Branch penalty : The number of stalls introduced during the branch operations in the pipelined processor is known as branch penalty.
  • 71. Control Dependency (Branch Hazards) • NOTE : As we see that the target address is available after the ID stage, so the number of stalls introduced in the pipeline is 1. • Suppose, the branch target address would have been present after the ALU stage, there would have been 2 stalls. Generally, if the target address is present after the kth stage, then there will be (k – 1) stalls in the pipeline. • Total number of stalls introduced in the pipeline due to branch instructions = Branch frequency * Branch Penalty
  • 72. Data Hazards • Data hazards occur when instructions that exhibit data dependence, modify data in different stages of a pipeline. Hazard cause delays in the pipeline. There are mainly three types of data hazards: • 1)RAW (Read afterWrite) [Flow/True data dependency] 2)WAR (Write after Read) [Anti-Data dependency] 3)WAW (Write afterWrite) [Output data dependency] •WAR andWAW hazards occur during the out-of-order execution of the instructions.
  • 73. Data Hazards - RAW • RAW hazard occurs when instruction I2 tries to read data before instruction I1 writes it. Eg: I1: R2 <- R1 + R3 I2: R4 <- R2 + R3 • Domain (I1)= R1,R3 • Domain (I2)= R2,R3 • Range (I1)= R2 • Range (I2)= R4 • Range (I1) ∩ Domain (I2) ≠ ɸ
  • 74. Data Hazards - RAW Cycle/Inst. 1 2 3 4 5 I1 IF/D OF EX WB I2 IF/D OF EX WB Cycle/I nst. 1 2 3 4 5 6 7 8 I1 IF/D OF EX WB I2 - - - IF/D OF EX WB
  • 75. Data Hazards - WAR • WAR hazard occurs when instruction I2 tries to write data before instruction I1 reads it. Eg: I1: R2 <- R1 + R3 I2: R3 <- R4 + R5 • Domain (I1)= R1,R3 • Domain (I2)= R4,R5 • Range (I1)= R2 • Range (I2)= R3 • Domain (I1) ∩ Range (I2) ≠ ɸ
  • 76. Data Hazards - WAW • WAW hazard occurs when instruction I2 tries to write output before instruction I1 writes it. Eg: I1: R2 <- R1 + R3 I2: R2 <- R4 + R5 • Domain (I1)= R1,R3 • Domain (I2)= R4,R5 • Range (I1)= R2 • Range (I2)= R2 • Range(I1) ∩ Range (I2) ≠ ɸ
  • 77. Array Processor • An array processor is a processor that performs computations on large arrays of data. • The term is used to refer to two different types of processors. • An attached array processor is an auxiliary processor attached to a general-purpose computer. • It is intended to improve the performance of the host computer in specific numerical computation tasks. • An SIMD array processor is a processor that has a single- instruction multiple-data organization.
  • 78. Array Processor • It manipulates vector instructions by means of multiple functional units responding to a common instruction. • Although both types of array processors manipulate vectors, their internal organization is different. • The system with the attached processor satisfies the needs for complex arithmetic applications. • The objective of the attached array processor is to provide vector manipulation capabilities to a conventional computer at a fraction of the cost of supercomputers.
  • 79. AttachedArray Processor • An attached array processor is designed as a peripheral for a conventional host computer, and its purpose is to enhance the performance of the computer by providing vector processing for complex scientific applications. • It achieves high performance by means of parallel processing with multiple functional units. • It includes an arithmetic unit containing one or more pipelined floating-point adders and multipliers. • The array processor can be programmed by the user to accommodate a variety of complex arithmetic problems.
  • 80. AttachedArray Processor • The host computer is a general-purpose commercial computer and the attached processor is a back-end machine driven by the host computer. • The array processor is connected through an input-output controller to the computer and the computer treats it like an external interface. • The data for the attached processor are transferred from main memory to a local memory through a high-speed bus. • The general-purpose computer without the attached processor serves the users that need conventional data processing.
  • 81. Attached Array Processor with Host Computer
  • 82. SIMDArray Processor • An SIMD array processor is a computer with multiple processing units operating in parallel. • The processing units are synchronized to perform the same operation under the control of a common control unit, thus providing a single instruction stream, multiple data stream (SIMD) organization. • It contains a set of identical processing elements (PEs), each having a local memory M. • Each processor element includes an ALU, a floating-point arithmetic unit, and working registers.
  • 83. SIMDArray Processor • The master control unit controls the operations in the processor elements. • The main memory is used for storage of the program. • The function of the master control unit is to decode the instructions and determine how the instruction is to be executed. • Scalar and program control instructions are directly executed within the master control unit. Vector instructions are broadcast to all PEs simultaneously. • Each PE uses operands stored in its local memory.
  • 84.
  • 85. SIMDArray Processor • Consider, for example, the vector addition C = A + B. • The master control unit first stores the ith components ai and bi of A and B in local memory M, for i = 1, 2, 3, . . . , n. • It then broadcasts floating-point add instruction ci = ai + bi to all PEs, causing the addition to take place simultaneously. • The components of ci are stored in fixed locations in each local memory. • This produces the desired vector sum in one add cycle.
  • 86. SIMDArray Processor • Masking schemes are used to control the status of each PE during the execution of vector instructions. • Each PE has a flag that is set when the PE is active and reset when the PE is inactive. • This ensures that only those PEs that need to participate are active during the execution of the instruction. • For example, suppose that the array processor contains a set of 64 PEs. If a vector length of less than 64 data items is to be processed the control unit selects the proper number of PEs to be active.
  • 87. Vector Processing • There is a class of computational problems that are beyond the capabilities of a conventional computer. • These problems are characterized by the fact that they require a vast number of computations that will take a conventional computer days or even weeks to complete. • Computers with vector processing capabilities are in demand in specialized applications. • The following are representative application areas where vector processing is of the utmost importance.
  • 88. Vector Processing - Applications • Long-range weather forecasting • Petroleum explorations • Medical diagnosis • Aerodynamics and space flight simulations • Artificial intelligence and expert systems • Image processing
  • 89. Vector Operations • Many scientific problems require arithmetic operations on large arrays of numbers. • These numbers are usually formulated as vectors and matrices of floating-point numbers. • A vector is an ordered set of a one-dimensional array of data items. • A vector V of length n is represented as a row vector by V = [V1 V2 V3 · · · Vn]. It may be represented as a column vector if the data items are listed in a column. • A conventional sequential computer is capable of processing operands one at a time.
  • 90. Vector Operations • Consequently, operations on vectors must be broken down into single computations with subscripted variables. • The element Vi of vector V is written as V(I) and the index I refers to a memory address or register where the number is stored. • To examine the difference between a conventional scalar processor and a vector processor, consider the following Fortran DO loop: DO 20 I= 1, 100 20 C (I) = B(I) + A ( I)
  • 91. Vector Operations • This is a program for adding two vectors A and B of length 100 to produce a vector C. This is implemented in machine language by the following sequence of operations. Initialize I= 0 20 Read A(I) Read B(I) Store C(I) = A ( I ) + B(I) Increment I= I+ 1 If I<=100 go to 20 Continue
  • 92. Vector Operations • This constitutes a program loop that reads a pair of operands from arrays A and B and performs a floating-point addition. • The loop control variable is then updated and the steps repeat 100 times. • A computer capable of vector processing eliminates the overhead associated with the time it takes to fetch and execute the instructions in the program loop. • It allows operations to be specified with a single vector instruction of the form C(1 : 100) = A(1 : 100) + B(1 : 100)
  • 94. Vector Instruction • The vector instruction includes the initial address of the operands, the length of the vectors, and the operation to be performed, all in one composite instruction. • This is essentially a three-address instruction with three fields specifying the base address of the operands and an additional field that gives the length of the data items in the vectors. This assumes that the vector operands reside in memory.
  • 95. Vector Instruction 1. Operation Code:- Operation code indicates the operation that has to be performed in the given instruction. It decides the functional unit for the specified operation or reconfigures the multifunction unit. 2. Base Address:- Base address field refers to the memory location from where the operands are to be fetched or to where the result has to be stored. The base address is found in the memory reference instructions. In the vector instruction, the operand and the result both are stored in the vector registers. Here, the base address refers to the designated vector register. 3. Vector Length:-Vector length specifies the number of elements in a vector operand. It identifies the termination of a vector instruction.