SlideShare una empresa de Scribd logo
1 de 189
UNIT-3
Parallel Algorithm
Design Principles and
Programming
Topic to be covered
What is Algorithm
• An algorithm is a sequence of steps that take inputs
from the user and after some computation, produces an
output.
• As per the architecture, there are two types of computers
– Sequential Computer
– Parallel Computer
Cont..
• Depending on the architecture of computers, we have two types of
algorithms
• Sequential Algorithm − An algorithm in which some consecutive steps of
instructions are executed in a chronological order to solve a problem.
• Parallel Algorithm − The problem is divided into sub-problems and are
executed in parallel to get individual outputs. Later on, these individual
outputs are combined together to get the final desired output.
OR
• A parallel algorithm is an algorithm that can execute several instructions
simultaneously on different processing devices and then combine all the
individual outputs to produce the final result.
Constructing a Parallel Algorithm
• identify portions of work that can be performed
concurrently
• map concurrent portions of work onto multiple
processes running in parallel
• distribute a program’s input, output, and
intermediate data
• manage accesses to shared data: avoid conflicts
• synchronize the processes at stages of the parallel
program execution
Communications
• Who Needs Communications? The need for communications
between tasks depends upon your problem:
• YOU DON'T NEED COMMUNICATIONS
– Some types of problems can be decomposed and executed in parallel with
virtually no need for tasks to share data. These types of problems are often
called embarrassingly parallel - little or no communications are required
• Need for communication :Most parallel applications are not quite so
simple, and do require tasks to share data with each other.
• For example, a 2-D heat diffusion problem requires a task to know the
temperatures calculated by the tasks that have neighboring data. Changes to
neighboring data has a direct effect on that task's data.
•
COMMUNICATION OVERHEAD
• Inter-task communication virtually always implies overhead.
• Machine cycles and resources that could be used for computation are instead used to
package and transmit data.
• Communications frequently require some type of synchronization between tasks,
which can result in tasks spending time "waiting" instead of doing work.
• Competing communication traffic can saturate the available network bandwidth,
further aggravating performance problems.
SYNCHRONOUS(
coordination
) VS. ASYNCHRONOUS COMMUNICATIONS
• Synchronous communications require some type of "handshaking" between tasks
that are sharing data. This can be explicitly structured in code by the programmer, or
it may happen at a lower level unknown to the programmer.
• Synchronous communications are often referred to as blocking communications since
other work must wait until the communications have completed.
• Asynchronous communications allow tasks to transfer data independently from
one another. For example, task 1 can prepare and send a message to task 2, and
then immediately begin doing other work. When task 2 actually receives the data
doesn't matter.
– Asynchronous communications are often referred to as non-
blocking communications since other work can be done while the
communications are taking place.
Types of Synchronization
• BARRIER
• LOCK / SEMAPHORE
TASK DECOMPOSTION
Degree of Concurrency
• Degree of Concurrency: # of tasks that can execute in
parallel
• --maximum degree of concurrency: largest number of
concurrent tasks at any point of the execution
• --average degree of concurrency: average number of
tasks that can be executed concurrently
– Degree of Concurrency vs. Task Granularity
• Speedup= sequential execution time/parallel execution
time
• •Parallel efficiency = sequential execution time/(parallel
execution time ×processors used)
Task Interaction graph
• The pattern of interaction among tasks is captured by
what is known as Task Interaction graph.
• Node= task
• Edges connect task that interact with each other
3.1.3 Processes and Mapping
(1/5)
38
■ In general, the number of tasks in a decomposition
exceeds the number of processing elements
available.
■ For this reason, a parallel algorithm must also
provide a mapping of tasks to processes.
Processes and Mapping (2/5)
39
Note: We refer to the mapping as being from tasks to
processes, as opposed to processors. This is because typical
programming APIs, as we shall see, do not allow easy
binding of tasks to physical processors. Rather, we
aggregate tasks into processes and rely on the system to
map these processes to physical processors(CPU).
We use processes, not in the UNIX sense of a process,
rather, simply as a collection of tasks and associated data.
Processes and Mapping (3/5)
40
Dr. Hanif Durad
■ Appropriate mapping of tasks to processes is critical to the parallel
performance of an algorithm.
■
■
■ Mappings are determined by both the task dependency and task interaction
graphs.
Task dependency graphs can be used to ensure that work is equally spread
across all processes at any point (minimum idling and optimal load balance).
Task interaction graphs can be used to make sure that processes need minimum
interaction with other processes (minimum communication).
Processes and Mapping (4/5)
41
An appropriate mapping must minimize parallel
execution time by:
■ Mapping independent tasks to different processes.
■ Assigning tasks on critical path to processes as soon as they become available.
■ Minimizing interaction between processes by mapping tasks with dense
interactions to the same process.
Processes and Mapping (5/5)
42
Dr. Hanif Durad
■ Difficulty: criteria often conflict with one
another
■ e.g. no decomposition minimizes interactions but no
speedup!
Processes and Mapping: Example
(1/2)
43
Dr. Hanif Durad
■ Mapping tasks in the database query
decomposition to processes.
■ These mappings were arrived at by viewing the dependency graph in
terms of levels (no two nodes in a level have dependencies).
Tasks within a single level are then assigned to different processes.
■
50
Processes and Mapping: Example
(2/2)
3.1.4 Processes vs. processors (1/2)
■
■
Processors are physical resources
Processes provide a convenient way of abstracting or virtualizing a multiprocessor.
Write parallel programs in terms of processes not physical processors;
Mapping processes to processors is a subsequent step.
The number of processes does not have to be the same as the number of processors
available to the program.
■
■
■
■
■ If there are more processes, they are multiplexed onto the available processors;
If there are fewer processes, then some processors will remain idle
Processes vs. processors (2/2)
■ The correct OS definition of a process is that of an
address space and one or more threads of control that
share that address space.
Thus, processes and threads are distinguished in that
definition.
In what follows: assume that a process has only one
thread of control.
■
■
Décomposition Techniques for Parallel
Algorithme
• So how does one decompose a task into various subtasks?
• While there is no single recipe that works for all problems, we
present a set of commonly used techniques that apply to
broad classes of problems. These include:
1. recursive decomposition
2. data decomposition
3. exploratory decomposition
4. speculative decomposition
1.recursive decomposition
• Generally suited to problems(Task) that are solved using
the divide and-conquer strategy.
• A given problem is first decomposed into a set of sub
problems.
• These sub-problems are recursively decomposed
further until a desired granularity is reached.
Specific algorithms like parallel Merge Sort
• Merge sort
• Quick Sort
• Binary search tree
■ Identify the data on which computations are performed.
■ Partition this data across various tasks.
■ This partitioning induces a decomposition of the problem.
■ Data can be partitioned in various ways - this critically impacts performance of a parallel
algorithm.
66
3.2.2 Data Decomposition (1/3)
3.2.2 Data Decomposition (2/3)
■ Is a powerful and commonly used method for deriving
concurrency in algorithms that operate on large data structures.
■ The decomposition of computations is done in two steps.
1. The data on which the computations are performed is
partitioned,
2. This data partitioning is used to induce a partitioning of the
computations into tasks.
Data Decomposition (3/3)
■ The operations that these tasks perform on different data
partitions are usually similar (e.g., matrix multiplication that
follows) or are chosen from a small set of operations (e.g., LU
factorization).
■ One must explore and evaluate all possible ways of partitioning
the data and determine which one yields a natural and efficient
computational decomposition.
3.2.2.1 Partitioning Output Data
(1/10)
■ In many computations, each element of the output can be
computed independently of others as a function of the input.
■ In such computations, a partitioning of the output data
automatically induces a decomposition of the problems into tasks
■ each task is assigned the work of computing a portion of the
output
Consider the problem of multiplying two n x n matrices A and B to
yield matrix C. The output matrix C can be partitioned into four
tasks as follows:
Partitioning Output Data: Example-1
71
Partitioning Output Data : Example-1 (2/3)
■ The 4 sub-matrices of C, roughly of size
■ (n/2 x n/2) each, are then independently
computed by 4 tasks as the sums of the appropriate products
of sub-matrices of A and B
■Other task decompositions possible
Decomposition I
Task 1: C1,1 = A1,1 B1,1
Task 2: C1,1 = C1,1 + A1,2 B2,1
Task 3: C1,2 = A1,1 B1,2
Task 4: C1,2 = C1,2 + A1,2 B2,2
Task 5: C2,1 = A2,1 B1,1
Task 6: C2,1 = C2,1 + A2,2 B2,1
Task 7: C2,2 = A2,1 B1,2
Decomposition II
Task 1: C1,1 = A1,1 B1,1
Task 2: C1,1 = C1,1 + A1,2 B2,1
Task 3: C1,2 = A1,2 B2,2
Task 4: C1,2 = C1,2 + A1,1 B1,2
Task 5: C2,1 = A2,2 B2,1
Task 6: C2,1 = C2,1 + A2,1 B1,1
Task 7: C2,2 = A2,1 B1,2
Partitioning Output Data: Example-1
Task 8: C2,2 = C2,2 + A2,2 B2,2 Task 8: C2,2 = C2,2 + A2,2 B2,2
▪ Independent task ?
Partitioning Output Data : Example -2 (1/6)
■ Computing frequencies of itemsets in a transaction database.
■ Given a set T containing n transactions and a set I
containing m itemsets.
■ Each transaction and itemset contains a small no. items, out of a possible set of items.
(5/10)
Partitioning Output Data :
Example -2 (2/6)
■ Example: T is a grocery stores database of customer sales with each
transaction being an individual grocery list of a shopper & itemset could be a
group of items in the store.
■ If the store desires to find out how many customers bought each of the
designated groups of items, then it would need to find the no. times that
each itemset in I appears in all the transactions (the no. transactions of
which each itemset is a subset of)
Dr. Hanif Durad 74
Partitioning Output Data : Example-2
(3/6)
(7/10)
Dr. Hanif Durad 75
Partitioning Output Data : Example -2
(4/6)
(8/10)
Dr. Hanif Durad 76
Partitioning Output Data :
Example -2 (5/6)
■ Fig. (a) shows an example
■ The database shown consists of 10 transactions, and
■ We are interested in computing the frequency of the 8 itemsets shown in
the second column.
■ The actual frequencies of these itemsets in the database (output) are shown
in the third column.
■ For instance, itemset {D, K} appears twice, once in
the second and once in the ninth transaction.
Partitioning Output Data :
Example -2 (6/6)
■ Fig. (b) shows how the computation of
frequencies of the itemsets can be decomposed
into 2 tasks
■ by partitioning the output into 2 parts and having each
task compute its half of the frequencies.
Dr. Hanif Durad
(10/10)
3.2.2.2 Partitioning Input Data
(1/6)
■ Remark: Partitioning of output data can be performed only if
each output can be naturally computed as a function of the
input.
In many algorithms, it is not possible or desirable to partition
the output data.
For example:
■
■
■
■ While finding the minimum, maximum, or the sum of a set of numbers,
the output is a single unknown value.
In a sorting algorithm, the individual elements of the output cannot be
efficiently determined in isolation.
Partitioning Input Data (2/6)
68
Dr. Hanif Durad
■ It is sometimes possible to partition the input data, and
then use this partitioning to induce concurrency.
■
■ A task is created for each partition of the input data and this task
performs as much computation as possible using these local data.
Solutions to tasks induced by input partitions may not directly solve
original problem
■ In such cases, a follow-up computation is needed to combine the
results.
Partitioning Input Data
Example -1
69
Dr. Hanif Durad
■ Example: finding the sum of N numbers using p
processes (N > p):
■
■
■ We can partition the input into p subsets of nearly equal
sizes.
Each task then computes the sum of the numbers in one of
the subsets.
Finally, the p partial results can be added up to yield the
final result
Partitioning Input Data
Example -2 (1/3)
70
Dr. Hanif Durad
■ The problem of computing the frequency of a
set of itemsets
■ Can also be decomposed based on a partitioning
of input data.
■ Fig. shows a decomposition based on a
partitioning of the input set of transactions.
Dr. Hanif Durad (5/6) 83
Partitioning Input Data : Example-2
(2/3)
Intermediate
result
Intermediate
result
Partitioning Input Data
Example -2 (3/3)
■ Each of the two tasks computes the frequencies of all the
itemsets in its respective subset of transactions.
■ The two sets of frequencies, which are the independent outputs
of the two tasks, represent intermediate results.
■ Combining the intermediate results by pair wise addition yields
the final result.
(6/6)
84
3.2.2.3 Partitioning both Input and Output
Data (1/2)
■
In some cases, in which it is possible to partition the output data, partitioning
of input data can offer additional concurrency.
Example: consider the 4-way decomposition shown in Fig. for computing
itemset frequencies.
■
■
■
■ both the transaction set and the frequencies are divided into two parts and
a different one of the four possible combinations is assigned to each of the
four tasks.
Each task then computes a local set of frequencies.
Finally, the outputs of Tasks 1 and 3 are added together, as are the outputs of
Tasks 2 and 4.
86
Partitioning Input and Output Data
Dr. Hanif Durad 87
3.2.2.4 Intermediate Data
Partitioning (1/8)
■ Partitioning intermediate data can sometimes lead to higher concurrency than
partitioning input or output data.
Often, the intermediate data are not generated explicitly in the serial
algorithm for solving the problem
Some restructuring of the original alg. May be required to use intermediate
data partitioning to induce a decomposition
■
■
Intermediate Data Partitioning
(2/8)
■ Example: matrix multiplication
■ Recall that the decompositions induced by a 2x 2 partitioning of the output matrix
C have a maximum degree of concurrency of four.
■ Increase the degree of concurrency by introducing an intermediate stage in which
8 tasks compute their respective product submatrices and store the results in a
temporary 3-d matrix D, as shown in Fig.
■ Submatrix Dk,i,j is the product of Ai,k and Bk,j.
Dr. Hanif Durad 89
Intermediate Data Partitioning: (3/8)
Example
Intermediate Data Partitioning
(4/8)
■ A partitioning of the intermediate matrix D induces a decomposition into
eight tasks.
■ After the multiplication phase, a relatively inexpensive matrix addition step
can compute the result matrix C
■ All submatrices D*,i,j with the same second and third dimensions i and j are
added to yield Ci,j.
91
Intermediate Data Partitioning:
Example (5/8)
A decomposition of intermediate data structure
to the following decomposition into 8 + 4 tasks:
Stage I
leads
Stage II
Intermediate Data Partitioning
(6/8)
■ The eight tasks numbered 1 through 8 in Fig. perform O(n3/8)
work each in multiplying (n/2 x n/2 ) submatrices of A and B
■ Then, four tasks numbered 9 through 12 spend O(n2/4) time each
in adding the appropriate (n/2 x n/2) submatrices of the
intermediate matrix D to yield the final result matrix C
■ Second Fig. shows the task dependency graph
Intermediate Data Partitioning:
Example (7/8)
81
Dr. Hanif Durad
Task 01: D1,1,1= A1,1 B1,1
Task 03: D1,1,2= A1,1 B1,2
Task 05: D1,2,1= A2,1 B1,1
Task 07: D1,2,2= A2,1 B1,2
Task 09: C1,1 = D1,1,1 + D2,1,1
Task 11: C2,1 = D1,2,1 + D2,2,1
Task 02: D2,1,1= A1,2 B2,1
Task 04: D2,1,2= A1,2 B2,2
Task 06: D2,2,1= A2,2 B2,1
Task 08: D2,2,2= A2,2 B2,2
Task 10: C1,2 = D1,1,2 + D2,1,2
Task 12: C2,,2 = D1,2,2 + D2,2,2
Intermediate Data Partitioning:
Example (8/8)
The task dependency graph for the
decomposition (shown in previous foil) into 12
tasks is as follows:
82
Dr. Hanif Durad
The Owner Computes Rule (1/2)
■
A decomposition based on partitioning output or input
data is also widely referred to as the owner-computes
rule.
The idea behind this rule is that each partition performs
all the computations involving data that it owns.
Depending on the nature of the data or the type of data-
partitioning, the owner-computes rule may mean
different things
■
■
The Owner Computes Rule (2/2)
■ When we assign partitions of the input data to
tasks, then the owner computes rule means that a
task performs all the computations that can be
done using these data.
■ If we partition the output data, then the owner-
computes rule means that a task computes all the
data in the partition assigned to it.
3.2.3 Exploratory Decomposition
(1/10)
■ In many cases, the decomposition of the problem goes hand-in-hand with its execution.
■ These problems typically involve the exploration (search) of a state space of solutions.
■ Problems in this class include a variety of
discrete optimization
programming, QAP
problems (0/1 integer
(quadratic assignment
problem), etc.), theorem proving, game playing,
Exploratory Decomposition-(2/10)
Example-1=> 15-puzzle problem
■ Consists of 15 tiles numbered 1 through 15 and one blank
tile placed in a 4 x 4 grid
A tile can be moved into the blank position from a
position adjacent to it, thus creating a blank in the tile's
original position.
Four moves are possible: up, down, left, and right.
The initial and final configurations of the tiles are
■
■
■
Exploratory Decomposition-
Example-1 =>15-puzzle problem (1/4)
■ Objective is to determine any sequence or a shortest
sequence of moves that transforms the initial configuration
to the final configuration
Fig. illustrates sample initial and final configurations and a
sequence of moves leading from the initial configuration to
the final configuration
■
(3/10)
Exploratory Decomposition-
Example-1 =>15-puzzle problem (2/4)
■ The puzzle is typically solved using tree-search techniques.
■
■
■
■ From initial configration, all possible successor configurations
are generated.
May have 2, 3, or 4 possible successor configurations depending
occupation of the empty slot by one of its neighbors.
The task of finding a path from initial to final configuration now
translates to finding a path from one of these newly generated
configurations to the final configuration.
Since one of these newly generated configurations must be closer
(4/10)
■ The configuration space generated by the tree search is the state space
graph.
■
Each node of the graph is a configuration and each edge of the graph connects
configurations that can be reached from one another by a single move of a tile.
One method for solving this problem in parallel:
■
■ First, a few levels of configurations starting from the initial configuration are
generated serially until the search tree has a sufficient number of leaf nodes
Now each node is assigned to a task to explore further until at least one of them
finds a solution
As soon as one of the concurrent tasks finds a solution it can inform the others
to terminate their searches.
■
■
■
Exploratory Decomposition-
Example-1 =>15-puzzle problem (3/4)
(5/10)
Exploratory Decomposition- (6/10)
Example-1 =>15-puzzle problem
(4/4)
Exploratory vs. data
decomposition (7/10)
■
The tasks induced by data-decomposition are performed
in their entirety & each task performs useful
computations towards the solution of the problem
In exploratory decomposition, unfinished tasks can be
terminated as soon as an overall solution is found.
■
■
■ Portion of the search space searched (& the aggregate amount
of work performed) by a parallel formulation can be different
from that searched by a serial algorithm
The work performed by the parallel formulation can be either
smaller or greater than that performed by the serial algorithm.
Exploratory Decomposition-
Example-2 (1/3) (8/10)
■ Example: consider a search space that has been partitioned
into four concurrent tasks as shown in Fig.
If the solution lies right at the beginning of the search
space corresponding to task 3
(Fig. (a)), then it will be found almost immediately by the
parallel formulation.
The serial algorithm would have found the solution only
■
■
■
Dr. Hanif Durad 105
Exploratory Decomposition-
Example-2 (2/3) (9/10)
Dr. Hanif Durad 106
■ On the other hand, if the solution lies towards the
end of the search space corresponding to task 1
(Fig (b)), then the parallel formulation will
perform almost four times the work of the serial
algorithm and will yield no speedup.
■ This change results in super- or sub-linear
speedups.
Exploratory Decomposition-
Example-2 (3/3) (10/10)
3.2.4 Speculative Decomposition
(1/9)
■ In some applications, dependencies between tasks are not
known a-priori.
For such applications, it is impossible to identify
independent tasks.
When a program may take one of many possible
computationally significant branches depending on the
output of other computations that precede it.
■
■
Dr. Hanif Durad 108
■ There are generally two approaches to dealing with
such applications: conservative approaches, which
identify independent tasks only when they are
guaranteed to not have dependencies, and, optimistic
approaches, which schedule tasks even when they may
potentially be erroneous.
Conservative approaches may yield little concurrency
and optimistic approaches may require roll-back
■
Speculative Decomposition
(2/9)
Speculative Decomposition
(3/9)
■ This scenario is similar to evaluating one or more of the branches
of a switch statement in C in parallel before the input for the
switch is available.
■
While one task is performing the computation that will eventually resolve
the switch, other tasks could pick up the multiple branches of the switch in
parallel.
When the input for the switch has finally been computed, the computation
corresponding to the correct branch would be used while that
corresponding to the other branches would be discarded.
The parallel run time is smaller than the serial run time by the amount of
■
■
Speculative Decomposition
■ This parallel formulation of a switch guarantees at least some
wasteful computation.
In order to minimize the wasted computation, a slightly
different formulation of speculative decomposition could be
used, especially in situations where one of the outcomes of
the switch is more likely than the others.
■
■ In this case, only the most promising branch is taken up a task in
parallel with the preceding computation.
In case the outcome of the switch is different from what was
anticipated, the computation is rolled back and the correct branch of
the switch is taken.
■
■
Speculative Decomposition:
Example-1
99
Dr. Hanif Durad
A classic example of speculative decomposition is
in discrete event simulation.
■The central data structure in a discrete event
simulation is a time-ordered event list.
■Events are extracted precisely in time order,
processed, and if required, resulting events are
inserted back into the event list.
Speculative Decomposition:
Example-1
100
Dr. Hanif Durad
Consider your day today as a discrete event system - you
get up, get ready, drive to work, work, eat lunch, work
some more, drive back, eat dinner, and sleep.
Each of these events may be processed independently,
however, in driving to work, you might meet with an
unfortunate accident and not get to work at all.
Therefore, an optimistic scheduling of other events will
have to be rolled back.
■
■
Speculative Decomposition:
Example-2
101
Dr. Hanif Durad
Another example is the simulation of a network
of nodes (for instance, an assembly line or a
computer network through which packets pass).
The task is to simulate the behavior of this
network for various inputs and node delay
parameters (note that networks may become
unstable for certain values of service rates,
Dr. Hanif Durad 114
Speculative Decomposition:
Example -2
Speculative Decomposition:
Example-2
■ The problem of simulating a sequence of input jobs on the
network described in this example appears inherently
sequential because the input of a typical component is the
output of another
However, we can define speculative tasks that start simulating
a subpart of the network, each assuming one of several possible
inputs to that stage.
When an actual input to a certain stage becomes available (as a
result of the completion of another selector task from a previous
stage),
■
■
■ then all or part of the work required to simulate this input would have
already been finished if the speculation was correct,
or the simulation of this stage is restarted with the most recent correct
input if the speculation was incorrect.
■
Speculative vs. exploratory
decomposition
■ In exploratory decomposition
■
the output of the multiple tasks originating at a branch is unknown.
the serial algorithm may explore different alternatives one after the other,
because the branch that may lead to the solution is not known beforehand
=> the parallel program may perform more, less, or the same amount of
aggregate work compared to the serial algorithm depending on the location of
the solution in the search space.
■
■ In speculative decomposition
■
■ the input at a branch leading to multiple parallel tasks is unknown the
serial algorithm would strictly perform only one of the tasks at a
speculative stage because when it reaches the beginning of that stage, it
knows exactly which branch to take.
=> a parallel program employing speculative decomposition performs more
3.2.5 Hybrid Decompositions
■ Decomposition techs are not exclusive, and can
often be combined together.
■ Often, a computation is structured into multiple
stages and it is sometimes necessary to apply
different types of decomposition in different stages.
Hybrid Decompositions
Example 1: Finding the minimum.
■Example 1: while finding the minimum of a
large set of n numbers,
■ a purely recursive decomposition may result in far
more tasks than the number of processes, P, available.
■ An efficient decomposition would partition the input
into P roughly equal parts and have each task
compute the minimum of the sequence assigned to it.
Dr. Hanif Durad 119
Hybrid Decompositions-Finding the
minimum.
Hybrid Decompositions-
Example 2: Quicksort in parallel.
■ recursive decomposition have been used in quicksort.
■
■
■ This formulation results in O(n) tasks for the problem of sorting a
sequence of size n.
But due to the dependencies among these tasks and due to uneven
sizes of the tasks, the effective concurrency is quite limited.
For example, the first task for splitting the input list into two parts
takes O(n) time, which puts an upper limit on the performance
gain possible via parallelization.
■ The step of splitting lists performed by tasks in parallel
quicksort can also be decomposed using the input
decomposition technique.
■The resulting hybrid decomposition that combines recursive
Input Data Decomposition
• Generally applicable if each output can be naturally computed
as a function of the input.
In many cases, this is the only natural decomposition because
the output is not clearly known a-priori (e.g., the problem of
finding the minimum in a list, sorting a given list, etc.).
• A task is associated with each input data partition. The task
performs as much of the computation with its part of the data.
Subsequent processing combines these partial results.
Exploratory Decomposition
In many cases, the decomposition of the problem
goes hand in-hand with its execution.
• These problems typically involve the exploration
(search) of a state space of solutions.
• Problems in this class include a variety of discrete
optimization problems (0/1 integer programming,
QAP, etc.),theorem proving, game playing, etc.
Speculative Decomposition: Example
Another example is the simulation of a network of nodes
(for instance, an assembly line or a computer network
through which packets pass). The task is to simulate the
behavior of this network for various inputs and node
delay parameters (note that networks may become
unstable for certain values of service rates, queue sizes,
etc.).
Hybrid Decompositions
Often, a mix of decomposition techniques is necessary for
decomposing a problem. Consider the following examples:
• In quicksort, recursive decomposition alone limits concurrency (Why?). A
mix of data and recursive decompositions is more desirable.
• In discrete event simulation, there might be concurrency in task processing.
A mix of speculative decomposition and data decomposition may work well.
• Even for simple problems like finding a minimum of a list of numbers, a mix
of data and recursive decomposition works well.
•Characteristics of
Tasks and Interactions
Characteristics of Tasks and Interactions
• We shall discuss the various properties of tasks and inter-
task interactions that affect the choice of a good mapping.
• Characteristics of Tasks
• Task Generation
• Task Sizes
• Knowledge of Task Sizes
• Size of Data Associated with Tasks
•Mapping Techniques
for Load Balancing
Task Generation
• Static task generation: Concurrent tasks can be
identified a-priori. Typical matrix operations, graph
algorithms, image processing applications, and other
regularly structured problems fall in this class. These can
typically be decomposed using data or recursive
decomposition techniques.
• Dynamic task generation: Tasks are generated as we
perform computation. A classic example of this is in
game playing - each 15 puzzle board is generated from
the previous one. These applications are typically
decomposed using exploratory or speculative
decompositions.
Task Sizes
• Task sizes may be uniform (i.e., all tasks are the same
size) or non-uniform.
• Non-uniform task sizes may be such that they can be
determined (or estimated) a-priori or not.
• Examples in this class include discrete optimization
problems, in which it is difficult to estimate the effective
size of a state space.
Size of Data Associated with Tasks
• The size of data associated with a task may be small or
large when viewed in the context of the size of the task.
• A small context of a task implies that an algorithm can
easily communicate this task to other processes
dynamically (e.g., the 15 puzzle).
• A large context ties the task to a process, or alternately,
an algorithm may attempt to reconstruct the context at
another processes as opposed to communicating the
context of the task (e.g., 0/1 integer programming).
Characteristics of Task Interactions
• Tasks may communicate with each other in various
ways. The associated dichotomy is:
• Static interactions: The tasks and their interactions are
known a-priori. These are relatively simpler to code into
programs.
• Dynamic interactions: The timing or interacting tasks
cannot be determined a-priori. These interactions are
harder to code, especitally, as we shall see, using
message passing APIs.
Characteristics of Task Interactions
• Regular interactions: There is a definite pattern (in the
graph sense) to the interactions. These patterns can be
exploited for efficient implementation.
• Irregular interactions: Interactions lack well-defined
topologies.
Characteristics of Task Interactions: Example
A simple example of a regular static interaction pattern is in image
dithering. The underlying communication pattern is a structured (2-D
mesh) one as shown here:
Characteristics of Task Interactions: Example
The multiplication of a sparse matrix with a vector is a good example
of a static irregular interaction pattern. Here is an example of a
sparse matrix and its associated interaction pattern.
Characteristics of Task Interactions
• Interactions may be read-only or read-write.
• In read-only interactions, tasks just read data items
associated with other tasks.
• In read-write interactions tasks read, as well as modily
data items associated with other tasks.
• In general, read-write interactions are harder to code,
since they require additional synchronization primitives.
Characteristics of Task Interactions
• Interactions may be one-way or two-way.
• A one-way interaction can be initiated and accomplished
by one of the two interacting tasks.
• A two-way interaction requires participation from both
tasks involved in an interaction.
• One way interactions are somewhat harder to code in
message passing APIs.
Mapping Techniques
• Once a problem has been decomposed into concurrent
tasks, these must be mapped to processes (that can be
executed on a parallel platform).
• Mappings must minimize overheads.
• Primary overheads are communication and idling.
• Minimizing these overheads often represents
contradicting objectives.
• Assigning all work to one processor trivially minimizes
communication at the expense of significant idling.
Mapping Techniques for Minimum Idling
Mapping must simultaneously minimize idling and load balance.
Merely balancing load does not minimize idling.
Mapping Techniques for Minimum Idling
Mapping techniques can be static or dynamic.
• Static Mapping: Tasks are mapped to processes a-priori.
For this to work, we must have a good estimate of the
size of each task. Even in these cases, the problem may
be NP complete.
• Dynamic Mapping: Tasks are mapped to processes at
runtime. This may be because the tasks are generated
at runtime, or that their sizes are not known.
Schemes for Static Mapping
• Mappings based on data partitioning.
• Mappings based on task graph partitioning.
• Hybrid mappings.
Mappings Based on Data Partitioning
We can combine data partitioning with the ``owner-computes'' rule to
partition the computation into subtasks. The simplest data
decomposition schemes for dense matrices are 1-D block
distribution schemes.
Block Array Distribution Schemes
Block distribution schemes can be generalized to
higher dimensions as well.
Block Array Distribution Schemes: Examples
• For multiplying two dense matrices A and B, we can
partition the output matrix C using a block
decomposition.
• For load balance, we give each task the same number of
elements of C. (Note that each element of C
corresponds to a single dot product.)
• The choice of precise decomposition (1-D or 2-D) is
determined by the associated communication overhead.
• In general, higher dimension decomposition allows the
use of larger number of processes.
Data Sharing in Dense Matrix Multiplication
Cyclic and Block Cyclic Distributions
• If the amount of computation associated with data items
varies, a block decomposition may lead to significant
load imbalances.
• A simple example of this is in LU decomposition (or
Gaussian Elimination) of dense matrices.
LU Factorization of a Dense Matrix
A decomposition of LU factorization into 14 tasks - notice the
significant load imbalance.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
Block Cyclic Distributions
• Variation of the block distribution scheme that can be used to
alleviate the load-imbalance and idling problems.
• Partition an array into many more blocks than the number of
available processes.
• Blocks are assigned to processes in a round-robin manner so that
each process gets several non-adjacent blocks.
Block-Cyclic Distribution for Gaussian
Elimination
The active part of the matrix in Gaussian Elimination changes.
By assigning blocks in a block-cyclic fashion, each processor
receives blocks from different parts of the matrix.
Block-Cyclic Distribution: Examples
One- and two-dimensional block-cyclic distributions among 4
processes.
Block-Cyclic Distribution
• A cyclic distribution is a special case in which block size is one.
• A block distribution is a special case in which block size is n/p ,
where n is the dimension of the matrix and p is the number of
processes.
Graph Partitioning Dased Data Decomposition
• In case of sparse matrices, block decompositions are
more complex.
• Consider the problem of multiplying a sparse matrix with
a vector.
• The graph of the matrix is a useful indicator of the work
(number of nodes) and communication (the degree of
each node).
• In this case, we would like to partition the graph so as to
assign equal number of nodes to each process, while
minimizing edge count of the graph partition.
Partitioning the Graph of Lake Superior
Random Partitioning
Partitioning for minimum edge-cut.
Mappings Based on Task Paritioning
• Partitioning a given task-dependency graph across
processes.
• Determining an optimal mapping for a general task-
dependency graph is an NP-complete problem.
• Excellent heuristics exist for structured graphs.
Task Paritioning: Mapping a Binary Tree
Dependency Graph
Example illustrates the dependency graph of one view of quick-sort
and how it can be assigned to processes in a hypercube.
Task Paritioning: Mapping a Sparse Graph
Sparse graph for computing a sparse matrix-vector product and
its mapping.
Hierarchical Mappings
• Sometimes a single mapping technique is inadequate.
• For example, the task mapping of the binary tree
(quicksort) cannot use a large number of processors.
• For this reason, task mapping can be used at the top
level and data partitioning within each level.
An example of task partitioning at top level with data
partitioning at the lower level.
Schemes for Dynamic Mapping
• Dynamic mapping is sometimes also referred to as
dynamic load balancing, since load balancing is the
primary motivation for dynamic mapping.
• Dynamic mapping schemes can be centralized or
distributed.
Centralized Dynamic Mapping
• Processes are designated as masters or slaves.
• When a process runs out of work, it requests the master
for more work.
• When the number of processes increases, the master
may become the bottleneck.
• To alleviate this, a process may pick up a number of
tasks (a chunk) at one time. This is called Chunk
scheduling.
• Selecting large chunk sizes may lead to significant load
imbalances as well.
Distributed Dynamic Mapping
• Each process can send or receive work from other
processes.
• This alleviates the bottleneck in centralized schemes.
• There are four critical questions: how are sensing and
receiving processes paired together, who initiates work
transfer, how much work is transferred, and when is a
transfer triggered?
• Answers to these questions are generally application
specific. We will look at some of these techniques later in
this class.
Minimizing Interaction Overheads
• Maximize data locality: Where possible, reuse
intermediate data. Restructure computation so that data
can be reused in smaller time windows.
• Minimize volume of data exchange: There is a cost
associated with each word that is communicated. For
this reason, we must minimize the volume of data
communicated.
• Minimize frequency of interactions: There is a startup
cost associated with each interaction. Therefore, try to
merge multiple interactions to one, where possible.
Minimizing Interaction Overheads (continued)
• Overlapping computations with interactions: Use non-
blocking communications, multithreading, and
prefetching to hide latencies.
• Replicating data or computations.
• Using group communications instead of point-to-point
primitives.
• Overlap interactions with other interactions.
Parallel Algorithm Models
An algorithm model is a way of structuring a parallel
algorithm by selecting a decomposition and mapping
technique and applying the appropriate strategy to
minimize interactions.
• Data Parallel Model: Tasks are statically (or semi-
statically) mapped to processes and each task performs
similar operations on different data.
• Task Graph Model: Starting from a task dependency
graph, the interrelationships among the tasks are utilized
Parallel Algorithm Models (continued)
• Master-Slave Model: One or more processes generate
work and allocate it to worker processes. This allocation
may be static or dynamic.
• Pipeline / Producer-Comsumer Model: A stream of data
is passed through a succession of processes, each of
which perform some task on it.
• Hybrid Models: A hybrid model may be composed either
of multiple models applied hierarchically or multiple
models applied sequentially to different phases of a
parallel algorithm.
Unit-3.ppt
Unit-3.ppt

Más contenido relacionado

Similar a Unit-3.ppt

Week # 1.pdf
Week # 1.pdfWeek # 1.pdf
Week # 1.pdfgiddy5
 
01-MessagePassingFundamentals.ppt
01-MessagePassingFundamentals.ppt01-MessagePassingFundamentals.ppt
01-MessagePassingFundamentals.pptHarshitPal37
 
Lecture 2 more about parallel computing
Lecture 2   more about parallel computingLecture 2   more about parallel computing
Lecture 2 more about parallel computingVajira Thambawita
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5Shah Zaib
 
parallel Questions & answers
parallel Questions & answersparallel Questions & answers
parallel Questions & answersMd. Mashiur Rahman
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureMani Goswami
 
Report on High Performance Computing
Report on High Performance ComputingReport on High Performance Computing
Report on High Performance ComputingPrateek Sarangi
 
cs1311lecture25wdl.ppt
cs1311lecture25wdl.pptcs1311lecture25wdl.ppt
cs1311lecture25wdl.pptFannyBellows
 
Chapter 5.pptx
Chapter 5.pptxChapter 5.pptx
Chapter 5.pptxJoeBaker69
 
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptxICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptxjohnsmith96441
 
Design and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxDesign and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxSyed Zaid Irshad
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfTSANKARARAO
 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptxTadiwaMawere
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureBalaji Vignesh
 

Similar a Unit-3.ppt (20)

Week # 1.pdf
Week # 1.pdfWeek # 1.pdf
Week # 1.pdf
 
01-MessagePassingFundamentals.ppt
01-MessagePassingFundamentals.ppt01-MessagePassingFundamentals.ppt
01-MessagePassingFundamentals.ppt
 
Lecture 2 more about parallel computing
Lecture 2   more about parallel computingLecture 2   more about parallel computing
Lecture 2 more about parallel computing
 
Lecture1
Lecture1Lecture1
Lecture1
 
Cc module 3.pptx
Cc module 3.pptxCc module 3.pptx
Cc module 3.pptx
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5
 
parallel Questions & answers
parallel Questions & answersparallel Questions & answers
parallel Questions & answers
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
Report on High Performance Computing
Report on High Performance ComputingReport on High Performance Computing
Report on High Performance Computing
 
cs1311lecture25wdl.ppt
cs1311lecture25wdl.pptcs1311lecture25wdl.ppt
cs1311lecture25wdl.ppt
 
Chapter 5.pptx
Chapter 5.pptxChapter 5.pptx
Chapter 5.pptx
 
Chap5 slides
Chap5 slidesChap5 slides
Chap5 slides
 
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptxICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
 
try
trytry
try
 
Design and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxDesign and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptx
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptx
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
 

Último

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 

Último (20)

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 

Unit-3.ppt

  • 2. Topic to be covered
  • 3. What is Algorithm • An algorithm is a sequence of steps that take inputs from the user and after some computation, produces an output. • As per the architecture, there are two types of computers – Sequential Computer – Parallel Computer
  • 4. Cont.. • Depending on the architecture of computers, we have two types of algorithms • Sequential Algorithm − An algorithm in which some consecutive steps of instructions are executed in a chronological order to solve a problem. • Parallel Algorithm − The problem is divided into sub-problems and are executed in parallel to get individual outputs. Later on, these individual outputs are combined together to get the final desired output. OR • A parallel algorithm is an algorithm that can execute several instructions simultaneously on different processing devices and then combine all the individual outputs to produce the final result.
  • 5. Constructing a Parallel Algorithm • identify portions of work that can be performed concurrently • map concurrent portions of work onto multiple processes running in parallel • distribute a program’s input, output, and intermediate data • manage accesses to shared data: avoid conflicts • synchronize the processes at stages of the parallel program execution
  • 6.
  • 7. Communications • Who Needs Communications? The need for communications between tasks depends upon your problem: • YOU DON'T NEED COMMUNICATIONS – Some types of problems can be decomposed and executed in parallel with virtually no need for tasks to share data. These types of problems are often called embarrassingly parallel - little or no communications are required • Need for communication :Most parallel applications are not quite so simple, and do require tasks to share data with each other. • For example, a 2-D heat diffusion problem requires a task to know the temperatures calculated by the tasks that have neighboring data. Changes to neighboring data has a direct effect on that task's data. •
  • 8. COMMUNICATION OVERHEAD • Inter-task communication virtually always implies overhead. • Machine cycles and resources that could be used for computation are instead used to package and transmit data. • Communications frequently require some type of synchronization between tasks, which can result in tasks spending time "waiting" instead of doing work. • Competing communication traffic can saturate the available network bandwidth, further aggravating performance problems.
  • 9. SYNCHRONOUS( coordination ) VS. ASYNCHRONOUS COMMUNICATIONS • Synchronous communications require some type of "handshaking" between tasks that are sharing data. This can be explicitly structured in code by the programmer, or it may happen at a lower level unknown to the programmer. • Synchronous communications are often referred to as blocking communications since other work must wait until the communications have completed. • Asynchronous communications allow tasks to transfer data independently from one another. For example, task 1 can prepare and send a message to task 2, and then immediately begin doing other work. When task 2 actually receives the data doesn't matter. – Asynchronous communications are often referred to as non- blocking communications since other work can be done while the communications are taking place.
  • 10. Types of Synchronization • BARRIER • LOCK / SEMAPHORE
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17. Degree of Concurrency • Degree of Concurrency: # of tasks that can execute in parallel • --maximum degree of concurrency: largest number of concurrent tasks at any point of the execution • --average degree of concurrency: average number of tasks that can be executed concurrently – Degree of Concurrency vs. Task Granularity
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. • Speedup= sequential execution time/parallel execution time • •Parallel efficiency = sequential execution time/(parallel execution time ×processors used)
  • 23.
  • 24. Task Interaction graph • The pattern of interaction among tasks is captured by what is known as Task Interaction graph. • Node= task • Edges connect task that interact with each other
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38. 3.1.3 Processes and Mapping (1/5) 38 ■ In general, the number of tasks in a decomposition exceeds the number of processing elements available. ■ For this reason, a parallel algorithm must also provide a mapping of tasks to processes.
  • 39. Processes and Mapping (2/5) 39 Note: We refer to the mapping as being from tasks to processes, as opposed to processors. This is because typical programming APIs, as we shall see, do not allow easy binding of tasks to physical processors. Rather, we aggregate tasks into processes and rely on the system to map these processes to physical processors(CPU). We use processes, not in the UNIX sense of a process, rather, simply as a collection of tasks and associated data.
  • 40. Processes and Mapping (3/5) 40 Dr. Hanif Durad ■ Appropriate mapping of tasks to processes is critical to the parallel performance of an algorithm. ■ ■ ■ Mappings are determined by both the task dependency and task interaction graphs. Task dependency graphs can be used to ensure that work is equally spread across all processes at any point (minimum idling and optimal load balance). Task interaction graphs can be used to make sure that processes need minimum interaction with other processes (minimum communication).
  • 41. Processes and Mapping (4/5) 41 An appropriate mapping must minimize parallel execution time by: ■ Mapping independent tasks to different processes. ■ Assigning tasks on critical path to processes as soon as they become available. ■ Minimizing interaction between processes by mapping tasks with dense interactions to the same process.
  • 42. Processes and Mapping (5/5) 42 Dr. Hanif Durad ■ Difficulty: criteria often conflict with one another ■ e.g. no decomposition minimizes interactions but no speedup!
  • 43. Processes and Mapping: Example (1/2) 43 Dr. Hanif Durad ■ Mapping tasks in the database query decomposition to processes. ■ These mappings were arrived at by viewing the dependency graph in terms of levels (no two nodes in a level have dependencies). Tasks within a single level are then assigned to different processes. ■
  • 44. 50 Processes and Mapping: Example (2/2)
  • 45. 3.1.4 Processes vs. processors (1/2) ■ ■ Processors are physical resources Processes provide a convenient way of abstracting or virtualizing a multiprocessor. Write parallel programs in terms of processes not physical processors; Mapping processes to processors is a subsequent step. The number of processes does not have to be the same as the number of processors available to the program. ■ ■ ■ ■ ■ If there are more processes, they are multiplexed onto the available processors; If there are fewer processes, then some processors will remain idle
  • 46. Processes vs. processors (2/2) ■ The correct OS definition of a process is that of an address space and one or more threads of control that share that address space. Thus, processes and threads are distinguished in that definition. In what follows: assume that a process has only one thread of control. ■ ■
  • 47. Décomposition Techniques for Parallel Algorithme • So how does one decompose a task into various subtasks? • While there is no single recipe that works for all problems, we present a set of commonly used techniques that apply to broad classes of problems. These include: 1. recursive decomposition 2. data decomposition 3. exploratory decomposition 4. speculative decomposition
  • 48. 1.recursive decomposition • Generally suited to problems(Task) that are solved using the divide and-conquer strategy. • A given problem is first decomposed into a set of sub problems. • These sub-problems are recursively decomposed further until a desired granularity is reached.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53. Specific algorithms like parallel Merge Sort • Merge sort • Quick Sort • Binary search tree
  • 54. ■ Identify the data on which computations are performed. ■ Partition this data across various tasks. ■ This partitioning induces a decomposition of the problem. ■ Data can be partitioned in various ways - this critically impacts performance of a parallel algorithm. 66 3.2.2 Data Decomposition (1/3)
  • 55. 3.2.2 Data Decomposition (2/3) ■ Is a powerful and commonly used method for deriving concurrency in algorithms that operate on large data structures. ■ The decomposition of computations is done in two steps. 1. The data on which the computations are performed is partitioned, 2. This data partitioning is used to induce a partitioning of the computations into tasks.
  • 56. Data Decomposition (3/3) ■ The operations that these tasks perform on different data partitions are usually similar (e.g., matrix multiplication that follows) or are chosen from a small set of operations (e.g., LU factorization). ■ One must explore and evaluate all possible ways of partitioning the data and determine which one yields a natural and efficient computational decomposition.
  • 57. 3.2.2.1 Partitioning Output Data (1/10) ■ In many computations, each element of the output can be computed independently of others as a function of the input. ■ In such computations, a partitioning of the output data automatically induces a decomposition of the problems into tasks ■ each task is assigned the work of computing a portion of the output
  • 58. Consider the problem of multiplying two n x n matrices A and B to yield matrix C. The output matrix C can be partitioned into four tasks as follows: Partitioning Output Data: Example-1
  • 59. 71 Partitioning Output Data : Example-1 (2/3) ■ The 4 sub-matrices of C, roughly of size ■ (n/2 x n/2) each, are then independently computed by 4 tasks as the sums of the appropriate products of sub-matrices of A and B ■Other task decompositions possible
  • 60. Decomposition I Task 1: C1,1 = A1,1 B1,1 Task 2: C1,1 = C1,1 + A1,2 B2,1 Task 3: C1,2 = A1,1 B1,2 Task 4: C1,2 = C1,2 + A1,2 B2,2 Task 5: C2,1 = A2,1 B1,1 Task 6: C2,1 = C2,1 + A2,2 B2,1 Task 7: C2,2 = A2,1 B1,2 Decomposition II Task 1: C1,1 = A1,1 B1,1 Task 2: C1,1 = C1,1 + A1,2 B2,1 Task 3: C1,2 = A1,2 B2,2 Task 4: C1,2 = C1,2 + A1,1 B1,2 Task 5: C2,1 = A2,2 B2,1 Task 6: C2,1 = C2,1 + A2,1 B1,1 Task 7: C2,2 = A2,1 B1,2 Partitioning Output Data: Example-1 Task 8: C2,2 = C2,2 + A2,2 B2,2 Task 8: C2,2 = C2,2 + A2,2 B2,2 ▪ Independent task ?
  • 61. Partitioning Output Data : Example -2 (1/6) ■ Computing frequencies of itemsets in a transaction database. ■ Given a set T containing n transactions and a set I containing m itemsets. ■ Each transaction and itemset contains a small no. items, out of a possible set of items. (5/10)
  • 62. Partitioning Output Data : Example -2 (2/6) ■ Example: T is a grocery stores database of customer sales with each transaction being an individual grocery list of a shopper & itemset could be a group of items in the store. ■ If the store desires to find out how many customers bought each of the designated groups of items, then it would need to find the no. times that each itemset in I appears in all the transactions (the no. transactions of which each itemset is a subset of) Dr. Hanif Durad 74
  • 63. Partitioning Output Data : Example-2 (3/6) (7/10) Dr. Hanif Durad 75
  • 64. Partitioning Output Data : Example -2 (4/6) (8/10) Dr. Hanif Durad 76
  • 65. Partitioning Output Data : Example -2 (5/6) ■ Fig. (a) shows an example ■ The database shown consists of 10 transactions, and ■ We are interested in computing the frequency of the 8 itemsets shown in the second column. ■ The actual frequencies of these itemsets in the database (output) are shown in the third column. ■ For instance, itemset {D, K} appears twice, once in the second and once in the ninth transaction.
  • 66. Partitioning Output Data : Example -2 (6/6) ■ Fig. (b) shows how the computation of frequencies of the itemsets can be decomposed into 2 tasks ■ by partitioning the output into 2 parts and having each task compute its half of the frequencies. Dr. Hanif Durad (10/10)
  • 67. 3.2.2.2 Partitioning Input Data (1/6) ■ Remark: Partitioning of output data can be performed only if each output can be naturally computed as a function of the input. In many algorithms, it is not possible or desirable to partition the output data. For example: ■ ■ ■ ■ While finding the minimum, maximum, or the sum of a set of numbers, the output is a single unknown value. In a sorting algorithm, the individual elements of the output cannot be efficiently determined in isolation.
  • 68. Partitioning Input Data (2/6) 68 Dr. Hanif Durad ■ It is sometimes possible to partition the input data, and then use this partitioning to induce concurrency. ■ ■ A task is created for each partition of the input data and this task performs as much computation as possible using these local data. Solutions to tasks induced by input partitions may not directly solve original problem ■ In such cases, a follow-up computation is needed to combine the results.
  • 69. Partitioning Input Data Example -1 69 Dr. Hanif Durad ■ Example: finding the sum of N numbers using p processes (N > p): ■ ■ ■ We can partition the input into p subsets of nearly equal sizes. Each task then computes the sum of the numbers in one of the subsets. Finally, the p partial results can be added up to yield the final result
  • 70. Partitioning Input Data Example -2 (1/3) 70 Dr. Hanif Durad ■ The problem of computing the frequency of a set of itemsets ■ Can also be decomposed based on a partitioning of input data. ■ Fig. shows a decomposition based on a partitioning of the input set of transactions.
  • 71. Dr. Hanif Durad (5/6) 83 Partitioning Input Data : Example-2 (2/3) Intermediate result Intermediate result
  • 72. Partitioning Input Data Example -2 (3/3) ■ Each of the two tasks computes the frequencies of all the itemsets in its respective subset of transactions. ■ The two sets of frequencies, which are the independent outputs of the two tasks, represent intermediate results. ■ Combining the intermediate results by pair wise addition yields the final result. (6/6) 84
  • 73. 3.2.2.3 Partitioning both Input and Output Data (1/2) ■ In some cases, in which it is possible to partition the output data, partitioning of input data can offer additional concurrency. Example: consider the 4-way decomposition shown in Fig. for computing itemset frequencies. ■ ■ ■ ■ both the transaction set and the frequencies are divided into two parts and a different one of the four possible combinations is assigned to each of the four tasks. Each task then computes a local set of frequencies. Finally, the outputs of Tasks 1 and 3 are added together, as are the outputs of Tasks 2 and 4.
  • 75. Dr. Hanif Durad 87 3.2.2.4 Intermediate Data Partitioning (1/8) ■ Partitioning intermediate data can sometimes lead to higher concurrency than partitioning input or output data. Often, the intermediate data are not generated explicitly in the serial algorithm for solving the problem Some restructuring of the original alg. May be required to use intermediate data partitioning to induce a decomposition ■ ■
  • 76. Intermediate Data Partitioning (2/8) ■ Example: matrix multiplication ■ Recall that the decompositions induced by a 2x 2 partitioning of the output matrix C have a maximum degree of concurrency of four. ■ Increase the degree of concurrency by introducing an intermediate stage in which 8 tasks compute their respective product submatrices and store the results in a temporary 3-d matrix D, as shown in Fig. ■ Submatrix Dk,i,j is the product of Ai,k and Bk,j.
  • 77. Dr. Hanif Durad 89 Intermediate Data Partitioning: (3/8) Example
  • 78. Intermediate Data Partitioning (4/8) ■ A partitioning of the intermediate matrix D induces a decomposition into eight tasks. ■ After the multiplication phase, a relatively inexpensive matrix addition step can compute the result matrix C ■ All submatrices D*,i,j with the same second and third dimensions i and j are added to yield Ci,j.
  • 79. 91 Intermediate Data Partitioning: Example (5/8) A decomposition of intermediate data structure to the following decomposition into 8 + 4 tasks: Stage I leads Stage II
  • 80. Intermediate Data Partitioning (6/8) ■ The eight tasks numbered 1 through 8 in Fig. perform O(n3/8) work each in multiplying (n/2 x n/2 ) submatrices of A and B ■ Then, four tasks numbered 9 through 12 spend O(n2/4) time each in adding the appropriate (n/2 x n/2) submatrices of the intermediate matrix D to yield the final result matrix C ■ Second Fig. shows the task dependency graph
  • 81. Intermediate Data Partitioning: Example (7/8) 81 Dr. Hanif Durad Task 01: D1,1,1= A1,1 B1,1 Task 03: D1,1,2= A1,1 B1,2 Task 05: D1,2,1= A2,1 B1,1 Task 07: D1,2,2= A2,1 B1,2 Task 09: C1,1 = D1,1,1 + D2,1,1 Task 11: C2,1 = D1,2,1 + D2,2,1 Task 02: D2,1,1= A1,2 B2,1 Task 04: D2,1,2= A1,2 B2,2 Task 06: D2,2,1= A2,2 B2,1 Task 08: D2,2,2= A2,2 B2,2 Task 10: C1,2 = D1,1,2 + D2,1,2 Task 12: C2,,2 = D1,2,2 + D2,2,2
  • 82. Intermediate Data Partitioning: Example (8/8) The task dependency graph for the decomposition (shown in previous foil) into 12 tasks is as follows: 82 Dr. Hanif Durad
  • 83. The Owner Computes Rule (1/2) ■ A decomposition based on partitioning output or input data is also widely referred to as the owner-computes rule. The idea behind this rule is that each partition performs all the computations involving data that it owns. Depending on the nature of the data or the type of data- partitioning, the owner-computes rule may mean different things ■ ■
  • 84. The Owner Computes Rule (2/2) ■ When we assign partitions of the input data to tasks, then the owner computes rule means that a task performs all the computations that can be done using these data. ■ If we partition the output data, then the owner- computes rule means that a task computes all the data in the partition assigned to it.
  • 85. 3.2.3 Exploratory Decomposition (1/10) ■ In many cases, the decomposition of the problem goes hand-in-hand with its execution. ■ These problems typically involve the exploration (search) of a state space of solutions. ■ Problems in this class include a variety of discrete optimization programming, QAP problems (0/1 integer (quadratic assignment problem), etc.), theorem proving, game playing,
  • 86. Exploratory Decomposition-(2/10) Example-1=> 15-puzzle problem ■ Consists of 15 tiles numbered 1 through 15 and one blank tile placed in a 4 x 4 grid A tile can be moved into the blank position from a position adjacent to it, thus creating a blank in the tile's original position. Four moves are possible: up, down, left, and right. The initial and final configurations of the tiles are ■ ■ ■
  • 87. Exploratory Decomposition- Example-1 =>15-puzzle problem (1/4) ■ Objective is to determine any sequence or a shortest sequence of moves that transforms the initial configuration to the final configuration Fig. illustrates sample initial and final configurations and a sequence of moves leading from the initial configuration to the final configuration ■ (3/10)
  • 88. Exploratory Decomposition- Example-1 =>15-puzzle problem (2/4) ■ The puzzle is typically solved using tree-search techniques. ■ ■ ■ ■ From initial configration, all possible successor configurations are generated. May have 2, 3, or 4 possible successor configurations depending occupation of the empty slot by one of its neighbors. The task of finding a path from initial to final configuration now translates to finding a path from one of these newly generated configurations to the final configuration. Since one of these newly generated configurations must be closer (4/10)
  • 89. ■ The configuration space generated by the tree search is the state space graph. ■ Each node of the graph is a configuration and each edge of the graph connects configurations that can be reached from one another by a single move of a tile. One method for solving this problem in parallel: ■ ■ First, a few levels of configurations starting from the initial configuration are generated serially until the search tree has a sufficient number of leaf nodes Now each node is assigned to a task to explore further until at least one of them finds a solution As soon as one of the concurrent tasks finds a solution it can inform the others to terminate their searches. ■ ■ ■ Exploratory Decomposition- Example-1 =>15-puzzle problem (3/4) (5/10)
  • 90. Exploratory Decomposition- (6/10) Example-1 =>15-puzzle problem (4/4)
  • 91. Exploratory vs. data decomposition (7/10) ■ The tasks induced by data-decomposition are performed in their entirety & each task performs useful computations towards the solution of the problem In exploratory decomposition, unfinished tasks can be terminated as soon as an overall solution is found. ■ ■ ■ Portion of the search space searched (& the aggregate amount of work performed) by a parallel formulation can be different from that searched by a serial algorithm The work performed by the parallel formulation can be either smaller or greater than that performed by the serial algorithm.
  • 92. Exploratory Decomposition- Example-2 (1/3) (8/10) ■ Example: consider a search space that has been partitioned into four concurrent tasks as shown in Fig. If the solution lies right at the beginning of the search space corresponding to task 3 (Fig. (a)), then it will be found almost immediately by the parallel formulation. The serial algorithm would have found the solution only ■ ■ ■
  • 93. Dr. Hanif Durad 105 Exploratory Decomposition- Example-2 (2/3) (9/10)
  • 94. Dr. Hanif Durad 106 ■ On the other hand, if the solution lies towards the end of the search space corresponding to task 1 (Fig (b)), then the parallel formulation will perform almost four times the work of the serial algorithm and will yield no speedup. ■ This change results in super- or sub-linear speedups. Exploratory Decomposition- Example-2 (3/3) (10/10)
  • 95. 3.2.4 Speculative Decomposition (1/9) ■ In some applications, dependencies between tasks are not known a-priori. For such applications, it is impossible to identify independent tasks. When a program may take one of many possible computationally significant branches depending on the output of other computations that precede it. ■ ■
  • 96. Dr. Hanif Durad 108 ■ There are generally two approaches to dealing with such applications: conservative approaches, which identify independent tasks only when they are guaranteed to not have dependencies, and, optimistic approaches, which schedule tasks even when they may potentially be erroneous. Conservative approaches may yield little concurrency and optimistic approaches may require roll-back ■ Speculative Decomposition (2/9)
  • 97. Speculative Decomposition (3/9) ■ This scenario is similar to evaluating one or more of the branches of a switch statement in C in parallel before the input for the switch is available. ■ While one task is performing the computation that will eventually resolve the switch, other tasks could pick up the multiple branches of the switch in parallel. When the input for the switch has finally been computed, the computation corresponding to the correct branch would be used while that corresponding to the other branches would be discarded. The parallel run time is smaller than the serial run time by the amount of ■ ■
  • 98. Speculative Decomposition ■ This parallel formulation of a switch guarantees at least some wasteful computation. In order to minimize the wasted computation, a slightly different formulation of speculative decomposition could be used, especially in situations where one of the outcomes of the switch is more likely than the others. ■ ■ In this case, only the most promising branch is taken up a task in parallel with the preceding computation. In case the outcome of the switch is different from what was anticipated, the computation is rolled back and the correct branch of the switch is taken. ■ ■
  • 99. Speculative Decomposition: Example-1 99 Dr. Hanif Durad A classic example of speculative decomposition is in discrete event simulation. ■The central data structure in a discrete event simulation is a time-ordered event list. ■Events are extracted precisely in time order, processed, and if required, resulting events are inserted back into the event list.
  • 100. Speculative Decomposition: Example-1 100 Dr. Hanif Durad Consider your day today as a discrete event system - you get up, get ready, drive to work, work, eat lunch, work some more, drive back, eat dinner, and sleep. Each of these events may be processed independently, however, in driving to work, you might meet with an unfortunate accident and not get to work at all. Therefore, an optimistic scheduling of other events will have to be rolled back. ■ ■
  • 101. Speculative Decomposition: Example-2 101 Dr. Hanif Durad Another example is the simulation of a network of nodes (for instance, an assembly line or a computer network through which packets pass). The task is to simulate the behavior of this network for various inputs and node delay parameters (note that networks may become unstable for certain values of service rates,
  • 102. Dr. Hanif Durad 114 Speculative Decomposition: Example -2
  • 103. Speculative Decomposition: Example-2 ■ The problem of simulating a sequence of input jobs on the network described in this example appears inherently sequential because the input of a typical component is the output of another However, we can define speculative tasks that start simulating a subpart of the network, each assuming one of several possible inputs to that stage. When an actual input to a certain stage becomes available (as a result of the completion of another selector task from a previous stage), ■ ■ ■ then all or part of the work required to simulate this input would have already been finished if the speculation was correct, or the simulation of this stage is restarted with the most recent correct input if the speculation was incorrect. ■
  • 104. Speculative vs. exploratory decomposition ■ In exploratory decomposition ■ the output of the multiple tasks originating at a branch is unknown. the serial algorithm may explore different alternatives one after the other, because the branch that may lead to the solution is not known beforehand => the parallel program may perform more, less, or the same amount of aggregate work compared to the serial algorithm depending on the location of the solution in the search space. ■ ■ In speculative decomposition ■ ■ the input at a branch leading to multiple parallel tasks is unknown the serial algorithm would strictly perform only one of the tasks at a speculative stage because when it reaches the beginning of that stage, it knows exactly which branch to take. => a parallel program employing speculative decomposition performs more
  • 105. 3.2.5 Hybrid Decompositions ■ Decomposition techs are not exclusive, and can often be combined together. ■ Often, a computation is structured into multiple stages and it is sometimes necessary to apply different types of decomposition in different stages.
  • 106. Hybrid Decompositions Example 1: Finding the minimum. ■Example 1: while finding the minimum of a large set of n numbers, ■ a purely recursive decomposition may result in far more tasks than the number of processes, P, available. ■ An efficient decomposition would partition the input into P roughly equal parts and have each task compute the minimum of the sequence assigned to it.
  • 107. Dr. Hanif Durad 119 Hybrid Decompositions-Finding the minimum.
  • 108. Hybrid Decompositions- Example 2: Quicksort in parallel. ■ recursive decomposition have been used in quicksort. ■ ■ ■ This formulation results in O(n) tasks for the problem of sorting a sequence of size n. But due to the dependencies among these tasks and due to uneven sizes of the tasks, the effective concurrency is quite limited. For example, the first task for splitting the input list into two parts takes O(n) time, which puts an upper limit on the performance gain possible via parallelization. ■ The step of splitting lists performed by tasks in parallel quicksort can also be decomposed using the input decomposition technique. ■The resulting hybrid decomposition that combines recursive
  • 109.
  • 110.
  • 111.
  • 112.
  • 113. Input Data Decomposition • Generally applicable if each output can be naturally computed as a function of the input. In many cases, this is the only natural decomposition because the output is not clearly known a-priori (e.g., the problem of finding the minimum in a list, sorting a given list, etc.). • A task is associated with each input data partition. The task performs as much of the computation with its part of the data. Subsequent processing combines these partial results.
  • 114. Exploratory Decomposition In many cases, the decomposition of the problem goes hand in-hand with its execution. • These problems typically involve the exploration (search) of a state space of solutions. • Problems in this class include a variety of discrete optimization problems (0/1 integer programming, QAP, etc.),theorem proving, game playing, etc.
  • 115.
  • 116.
  • 117. Speculative Decomposition: Example Another example is the simulation of a network of nodes (for instance, an assembly line or a computer network through which packets pass). The task is to simulate the behavior of this network for various inputs and node delay parameters (note that networks may become unstable for certain values of service rates, queue sizes, etc.).
  • 118. Hybrid Decompositions Often, a mix of decomposition techniques is necessary for decomposing a problem. Consider the following examples: • In quicksort, recursive decomposition alone limits concurrency (Why?). A mix of data and recursive decompositions is more desirable. • In discrete event simulation, there might be concurrency in task processing. A mix of speculative decomposition and data decomposition may work well. • Even for simple problems like finding a minimum of a list of numbers, a mix of data and recursive decomposition works well.
  • 120. Characteristics of Tasks and Interactions • We shall discuss the various properties of tasks and inter- task interactions that affect the choice of a good mapping. • Characteristics of Tasks • Task Generation • Task Sizes • Knowledge of Task Sizes • Size of Data Associated with Tasks
  • 121.
  • 122.
  • 123.
  • 124.
  • 125.
  • 126.
  • 127.
  • 128.
  • 129.
  • 131.
  • 132.
  • 133.
  • 134.
  • 135.
  • 136.
  • 137.
  • 138.
  • 139.
  • 140.
  • 141.
  • 142.
  • 143.
  • 144.
  • 145.
  • 146.
  • 147.
  • 148. Task Generation • Static task generation: Concurrent tasks can be identified a-priori. Typical matrix operations, graph algorithms, image processing applications, and other regularly structured problems fall in this class. These can typically be decomposed using data or recursive decomposition techniques. • Dynamic task generation: Tasks are generated as we perform computation. A classic example of this is in game playing - each 15 puzzle board is generated from the previous one. These applications are typically decomposed using exploratory or speculative decompositions.
  • 149.
  • 150.
  • 151.
  • 152. Task Sizes • Task sizes may be uniform (i.e., all tasks are the same size) or non-uniform. • Non-uniform task sizes may be such that they can be determined (or estimated) a-priori or not. • Examples in this class include discrete optimization problems, in which it is difficult to estimate the effective size of a state space.
  • 153. Size of Data Associated with Tasks • The size of data associated with a task may be small or large when viewed in the context of the size of the task. • A small context of a task implies that an algorithm can easily communicate this task to other processes dynamically (e.g., the 15 puzzle). • A large context ties the task to a process, or alternately, an algorithm may attempt to reconstruct the context at another processes as opposed to communicating the context of the task (e.g., 0/1 integer programming).
  • 154. Characteristics of Task Interactions • Tasks may communicate with each other in various ways. The associated dichotomy is: • Static interactions: The tasks and their interactions are known a-priori. These are relatively simpler to code into programs. • Dynamic interactions: The timing or interacting tasks cannot be determined a-priori. These interactions are harder to code, especitally, as we shall see, using message passing APIs.
  • 155. Characteristics of Task Interactions • Regular interactions: There is a definite pattern (in the graph sense) to the interactions. These patterns can be exploited for efficient implementation. • Irregular interactions: Interactions lack well-defined topologies.
  • 156. Characteristics of Task Interactions: Example A simple example of a regular static interaction pattern is in image dithering. The underlying communication pattern is a structured (2-D mesh) one as shown here:
  • 157. Characteristics of Task Interactions: Example The multiplication of a sparse matrix with a vector is a good example of a static irregular interaction pattern. Here is an example of a sparse matrix and its associated interaction pattern.
  • 158. Characteristics of Task Interactions • Interactions may be read-only or read-write. • In read-only interactions, tasks just read data items associated with other tasks. • In read-write interactions tasks read, as well as modily data items associated with other tasks. • In general, read-write interactions are harder to code, since they require additional synchronization primitives.
  • 159. Characteristics of Task Interactions • Interactions may be one-way or two-way. • A one-way interaction can be initiated and accomplished by one of the two interacting tasks. • A two-way interaction requires participation from both tasks involved in an interaction. • One way interactions are somewhat harder to code in message passing APIs.
  • 160. Mapping Techniques • Once a problem has been decomposed into concurrent tasks, these must be mapped to processes (that can be executed on a parallel platform). • Mappings must minimize overheads. • Primary overheads are communication and idling. • Minimizing these overheads often represents contradicting objectives. • Assigning all work to one processor trivially minimizes communication at the expense of significant idling.
  • 161. Mapping Techniques for Minimum Idling Mapping must simultaneously minimize idling and load balance. Merely balancing load does not minimize idling.
  • 162. Mapping Techniques for Minimum Idling Mapping techniques can be static or dynamic. • Static Mapping: Tasks are mapped to processes a-priori. For this to work, we must have a good estimate of the size of each task. Even in these cases, the problem may be NP complete. • Dynamic Mapping: Tasks are mapped to processes at runtime. This may be because the tasks are generated at runtime, or that their sizes are not known.
  • 163. Schemes for Static Mapping • Mappings based on data partitioning. • Mappings based on task graph partitioning. • Hybrid mappings.
  • 164. Mappings Based on Data Partitioning We can combine data partitioning with the ``owner-computes'' rule to partition the computation into subtasks. The simplest data decomposition schemes for dense matrices are 1-D block distribution schemes.
  • 165. Block Array Distribution Schemes Block distribution schemes can be generalized to higher dimensions as well.
  • 166. Block Array Distribution Schemes: Examples • For multiplying two dense matrices A and B, we can partition the output matrix C using a block decomposition. • For load balance, we give each task the same number of elements of C. (Note that each element of C corresponds to a single dot product.) • The choice of precise decomposition (1-D or 2-D) is determined by the associated communication overhead. • In general, higher dimension decomposition allows the use of larger number of processes.
  • 167. Data Sharing in Dense Matrix Multiplication
  • 168. Cyclic and Block Cyclic Distributions • If the amount of computation associated with data items varies, a block decomposition may lead to significant load imbalances. • A simple example of this is in LU decomposition (or Gaussian Elimination) of dense matrices.
  • 169. LU Factorization of a Dense Matrix A decomposition of LU factorization into 14 tasks - notice the significant load imbalance. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
  • 170. Block Cyclic Distributions • Variation of the block distribution scheme that can be used to alleviate the load-imbalance and idling problems. • Partition an array into many more blocks than the number of available processes. • Blocks are assigned to processes in a round-robin manner so that each process gets several non-adjacent blocks.
  • 171. Block-Cyclic Distribution for Gaussian Elimination The active part of the matrix in Gaussian Elimination changes. By assigning blocks in a block-cyclic fashion, each processor receives blocks from different parts of the matrix.
  • 172. Block-Cyclic Distribution: Examples One- and two-dimensional block-cyclic distributions among 4 processes.
  • 173. Block-Cyclic Distribution • A cyclic distribution is a special case in which block size is one. • A block distribution is a special case in which block size is n/p , where n is the dimension of the matrix and p is the number of processes.
  • 174. Graph Partitioning Dased Data Decomposition • In case of sparse matrices, block decompositions are more complex. • Consider the problem of multiplying a sparse matrix with a vector. • The graph of the matrix is a useful indicator of the work (number of nodes) and communication (the degree of each node). • In this case, we would like to partition the graph so as to assign equal number of nodes to each process, while minimizing edge count of the graph partition.
  • 175. Partitioning the Graph of Lake Superior Random Partitioning Partitioning for minimum edge-cut.
  • 176. Mappings Based on Task Paritioning • Partitioning a given task-dependency graph across processes. • Determining an optimal mapping for a general task- dependency graph is an NP-complete problem. • Excellent heuristics exist for structured graphs.
  • 177. Task Paritioning: Mapping a Binary Tree Dependency Graph Example illustrates the dependency graph of one view of quick-sort and how it can be assigned to processes in a hypercube.
  • 178. Task Paritioning: Mapping a Sparse Graph Sparse graph for computing a sparse matrix-vector product and its mapping.
  • 179. Hierarchical Mappings • Sometimes a single mapping technique is inadequate. • For example, the task mapping of the binary tree (quicksort) cannot use a large number of processors. • For this reason, task mapping can be used at the top level and data partitioning within each level.
  • 180. An example of task partitioning at top level with data partitioning at the lower level.
  • 181. Schemes for Dynamic Mapping • Dynamic mapping is sometimes also referred to as dynamic load balancing, since load balancing is the primary motivation for dynamic mapping. • Dynamic mapping schemes can be centralized or distributed.
  • 182. Centralized Dynamic Mapping • Processes are designated as masters or slaves. • When a process runs out of work, it requests the master for more work. • When the number of processes increases, the master may become the bottleneck. • To alleviate this, a process may pick up a number of tasks (a chunk) at one time. This is called Chunk scheduling. • Selecting large chunk sizes may lead to significant load imbalances as well.
  • 183. Distributed Dynamic Mapping • Each process can send or receive work from other processes. • This alleviates the bottleneck in centralized schemes. • There are four critical questions: how are sensing and receiving processes paired together, who initiates work transfer, how much work is transferred, and when is a transfer triggered? • Answers to these questions are generally application specific. We will look at some of these techniques later in this class.
  • 184. Minimizing Interaction Overheads • Maximize data locality: Where possible, reuse intermediate data. Restructure computation so that data can be reused in smaller time windows. • Minimize volume of data exchange: There is a cost associated with each word that is communicated. For this reason, we must minimize the volume of data communicated. • Minimize frequency of interactions: There is a startup cost associated with each interaction. Therefore, try to merge multiple interactions to one, where possible.
  • 185. Minimizing Interaction Overheads (continued) • Overlapping computations with interactions: Use non- blocking communications, multithreading, and prefetching to hide latencies. • Replicating data or computations. • Using group communications instead of point-to-point primitives. • Overlap interactions with other interactions.
  • 186. Parallel Algorithm Models An algorithm model is a way of structuring a parallel algorithm by selecting a decomposition and mapping technique and applying the appropriate strategy to minimize interactions. • Data Parallel Model: Tasks are statically (or semi- statically) mapped to processes and each task performs similar operations on different data. • Task Graph Model: Starting from a task dependency graph, the interrelationships among the tasks are utilized
  • 187. Parallel Algorithm Models (continued) • Master-Slave Model: One or more processes generate work and allocate it to worker processes. This allocation may be static or dynamic. • Pipeline / Producer-Comsumer Model: A stream of data is passed through a succession of processes, each of which perform some task on it. • Hybrid Models: A hybrid model may be composed either of multiple models applied hierarchically or multiple models applied sequentially to different phases of a parallel algorithm.