Unit-3.ppt

UNIT-3
Parallel Algorithm
Design Principles and
Programming

What is Algorithm
• An algorithm is a sequence of steps that take inputs
from the user and after some computation, produces an
output.
• As per the architecture, there are two types of computers
– Sequential Computer
– Parallel Computer

Cont..
• Depending on the architecture of computers, we have two types of
algorithms
• Sequential Algorithm − An algorithm in which some consecutive steps of
instructions are executed in a chronological order to solve a problem.
• Parallel Algorithm − The problem is divided into sub-problems and are
executed in parallel to get individual outputs. Later on, these individual
outputs are combined together to get the final desired output.
OR
• A parallel algorithm is an algorithm that can execute several instructions
simultaneously on different processing devices and then combine all the
individual outputs to produce the final result.

Constructing a Parallel Algorithm
• identify portions of work that can be performed
concurrently
• map concurrent portions of work onto multiple
processes running in parallel
• distribute a program’s input, output, and
intermediate data
• manage accesses to shared data: avoid conflicts
• synchronize the processes at stages of the parallel
program execution

Communications
• Who Needs Communications? The need for communications
between tasks depends upon your problem:
• YOU DON'T NEED COMMUNICATIONS
– Some types of problems can be decomposed and executed in parallel with
virtually no need for tasks to share data. These types of problems are often
called embarrassingly parallel - little or no communications are required
• Need for communication :Most parallel applications are not quite so
simple, and do require tasks to share data with each other.
• For example, a 2-D heat diffusion problem requires a task to know the
temperatures calculated by the tasks that have neighboring data. Changes to
neighboring data has a direct effect on that task's data.
•

COMMUNICATION OVERHEAD
• Inter-task communication virtually always implies overhead.
• Machine cycles and resources that could be used for computation are instead used to
package and transmit data.
• Communications frequently require some type of synchronization between tasks,
which can result in tasks spending time "waiting" instead of doing work.
• Competing communication traffic can saturate the available network bandwidth,
further aggravating performance problems.

SYNCHRONOUS(
coordination
) VS. ASYNCHRONOUS COMMUNICATIONS
• Synchronous communications require some type of "handshaking" between tasks
that are sharing data. This can be explicitly structured in code by the programmer, or
it may happen at a lower level unknown to the programmer.
• Synchronous communications are often referred to as blocking communications since
other work must wait until the communications have completed.
• Asynchronous communications allow tasks to transfer data independently from
one another. For example, task 1 can prepare and send a message to task 2, and
then immediately begin doing other work. When task 2 actually receives the data
doesn't matter.
– Asynchronous communications are often referred to as non-
blocking communications since other work can be done while the
communications are taking place.

Types of Synchronization
• BARRIER
• LOCK / SEMAPHORE

Degree of Concurrency
• Degree of Concurrency: # of tasks that can execute in
parallel
• --maximum degree of concurrency: largest number of
concurrent tasks at any point of the execution
• --average degree of concurrency: average number of
tasks that can be executed concurrently
– Degree of Concurrency vs. Task Granularity

• Speedup= sequential execution time/parallel execution
time
• •Parallel efficiency = sequential execution time/(parallel
execution time ×processors used)

Task Interaction graph
• The pattern of interaction among tasks is captured by
what is known as Task Interaction graph.
• Node= task
• Edges connect task that interact with each other

3.1.3 Processes and Mapping
(1/5)
38
■ In general, the number of tasks in a decomposition
exceeds the number of processing elements
available.
■ For this reason, a parallel algorithm must also
provide a mapping of tasks to processes.

Processes and Mapping (2/5)
39
Note: We refer to the mapping as being from tasks to
processes, as opposed to processors. This is because typical
programming APIs, as we shall see, do not allow easy
binding of tasks to physical processors. Rather, we
aggregate tasks into processes and rely on the system to
map these processes to physical processors(CPU).
We use processes, not in the UNIX sense of a process,
rather, simply as a collection of tasks and associated data.

40
Dr. Hanif Durad
■ Appropriate mapping of tasks to processes is critical to the parallel
performance of an algorithm.
■
■
■ Mappings are determined by both the task dependency and task interaction
graphs.
Task dependency graphs can be used to ensure that work is equally spread
across all processes at any point (minimum idling and optimal load balance).
Task interaction graphs can be used to make sure that processes need minimum
interaction with other processes (minimum communication).

41
An appropriate mapping must minimize parallel
execution time by:
■ Mapping independent tasks to different processes.
■ Assigning tasks on critical path to processes as soon as they become available.
■ Minimizing interaction between processes by mapping tasks with dense
interactions to the same process.

42
Dr. Hanif Durad
■ Difficulty: criteria often conflict with one
another
■ e.g. no decomposition minimizes interactions but no
speedup!

Processes and Mapping: Example
(1/2)
43
Dr. Hanif Durad
■ Mapping tasks in the database query
decomposition to processes.
■ These mappings were arrived at by viewing the dependency graph in
terms of levels (no two nodes in a level have dependencies).
Tasks within a single level are then assigned to different processes.
■

50
Processes and Mapping: Example
(2/2)

3.1.4 Processes vs. processors (1/2)
■
■
Processors are physical resources
Processes provide a convenient way of abstracting or virtualizing a multiprocessor.
Write parallel programs in terms of processes not physical processors;
Mapping processes to processors is a subsequent step.
The number of processes does not have to be the same as the number of processors
available to the program.
■
■
■
■
■ If there are more processes, they are multiplexed onto the available processors;
If there are fewer processes, then some processors will remain idle

Processes vs. processors (2/2)
■ The correct OS definition of a process is that of an
address space and one or more threads of control that
share that address space.
Thus, processes and threads are distinguished in that
definition.
In what follows: assume that a process has only one
thread of control.
■
■

Décomposition Techniques for Parallel
Algorithme
• So how does one decompose a task into various subtasks?
• While there is no single recipe that works for all problems, we
present a set of commonly used techniques that apply to
broad classes of problems. These include:
1. recursive decomposition
2. data decomposition
3. exploratory decomposition
4. speculative decomposition

1.recursive decomposition
• Generally suited to problems(Task) that are solved using
the divide and-conquer strategy.
• A given problem is first decomposed into a set of sub
problems.
• These sub-problems are recursively decomposed
further until a desired granularity is reached.

Specific algorithms like parallel Merge Sort
• Merge sort
• Quick Sort
• Binary search tree

■ Identify the data on which computations are performed.
■ Partition this data across various tasks.
■ This partitioning induces a decomposition of the problem.
■ Data can be partitioned in various ways - this critically impacts performance of a parallel
algorithm.
66
3.2.2 Data Decomposition (1/3)

3.2.2 Data Decomposition (2/3)
■ Is a powerful and commonly used method for deriving
concurrency in algorithms that operate on large data structures.
■ The decomposition of computations is done in two steps.
1. The data on which the computations are performed is
partitioned,
2. This data partitioning is used to induce a partitioning of the
computations into tasks.

Data Decomposition (3/3)
■ The operations that these tasks perform on different data
partitions are usually similar (e.g., matrix multiplication that
follows) or are chosen from a small set of operations (e.g., LU
factorization).
■ One must explore and evaluate all possible ways of partitioning
the data and determine which one yields a natural and efficient
computational decomposition.

3.2.2.1 Partitioning Output Data
(1/10)
■ In many computations, each element of the output can be
computed independently of others as a function of the input.
■ In such computations, a partitioning of the output data
automatically induces a decomposition of the problems into tasks
■ each task is assigned the work of computing a portion of the
output

Consider the problem of multiplying two n x n matrices A and B to
yield matrix C. The output matrix C can be partitioned into four
tasks as follows:
Partitioning Output Data: Example-1

71
Partitioning Output Data : Example-1 (2/3)
■ The 4 sub-matrices of C, roughly of size
■ (n/2 x n/2) each, are then independently
computed by 4 tasks as the sums of the appropriate products
of sub-matrices of A and B
■Other task decompositions possible

Decomposition I
Task 1: C1,1 = A1,1 B1,1
Task 2: C1,1 = C1,1 + A1,2 B2,1
Task 3: C1,2 = A1,1 B1,2
Task 4: C1,2 = C1,2 + A1,2 B2,2
Task 5: C2,1 = A2,1 B1,1
Task 6: C2,1 = C2,1 + A2,2 B2,1
Task 7: C2,2 = A2,1 B1,2
Decomposition II
Task 1: C1,1 = A1,1 B1,1
Task 2: C1,1 = C1,1 + A1,2 B2,1
Task 3: C1,2 = A1,2 B2,2
Task 4: C1,2 = C1,2 + A1,1 B1,2
Task 5: C2,1 = A2,2 B2,1
Task 6: C2,1 = C2,1 + A2,1 B1,1
Task 7: C2,2 = A2,1 B1,2
Partitioning Output Data: Example-1
Task 8: C2,2 = C2,2 + A2,2 B2,2 Task 8: C2,2 = C2,2 + A2,2 B2,2
▪ Independent task ?

Partitioning Output Data : Example -2 (1/6)
■ Computing frequencies of itemsets in a transaction database.
■ Given a set T containing n transactions and a set I
containing m itemsets.
■ Each transaction and itemset contains a small no. items, out of a possible set of items.
(5/10)

Partitioning Output Data :
Example -2 (2/6)
■ Example: T is a grocery stores database of customer sales with each
transaction being an individual grocery list of a shopper & itemset could be a
group of items in the store.
■ If the store desires to find out how many customers bought each of the
designated groups of items, then it would need to find the no. times that
each itemset in I appears in all the transactions (the no. transactions of
which each itemset is a subset of)
Dr. Hanif Durad 74

Partitioning Output Data : Example-2
(3/6)
(7/10)
Dr. Hanif Durad 75

Partitioning Output Data : Example -2
(4/6)
(8/10)
Dr. Hanif Durad 76

Example -2 (5/6)
■ Fig. (a) shows an example
■ The database shown consists of 10 transactions, and
■ We are interested in computing the frequency of the 8 itemsets shown in
the second column.
■ The actual frequencies of these itemsets in the database (output) are shown
in the third column.
■ For instance, itemset {D, K} appears twice, once in
the second and once in the ninth transaction.

Example -2 (6/6)
■ Fig. (b) shows how the computation of
frequencies of the itemsets can be decomposed
into 2 tasks
■ by partitioning the output into 2 parts and having each
task compute its half of the frequencies.
Dr. Hanif Durad
(10/10)

3.2.2.2 Partitioning Input Data
(1/6)
■ Remark: Partitioning of output data can be performed only if
each output can be naturally computed as a function of the
input.
In many algorithms, it is not possible or desirable to partition
the output data.
For example:
■
■
■
■ While finding the minimum, maximum, or the sum of a set of numbers,
the output is a single unknown value.
In a sorting algorithm, the individual elements of the output cannot be
efficiently determined in isolation.

Partitioning Input Data (2/6)
68
Dr. Hanif Durad
■ It is sometimes possible to partition the input data, and
then use this partitioning to induce concurrency.
■
■ A task is created for each partition of the input data and this task
performs as much computation as possible using these local data.
Solutions to tasks induced by input partitions may not directly solve
original problem
■ In such cases, a follow-up computation is needed to combine the
results.

Partitioning Input Data
Example -1
69
Dr. Hanif Durad
■ Example: finding the sum of N numbers using p
processes (N > p):
■
■
■ We can partition the input into p subsets of nearly equal
sizes.
Each task then computes the sum of the numbers in one of
the subsets.
Finally, the p partial results can be added up to yield the
final result

Example -2 (1/3)
70
Dr. Hanif Durad
■ The problem of computing the frequency of a
set of itemsets
■ Can also be decomposed based on a partitioning
of input data.
■ Fig. shows a decomposition based on a
partitioning of the input set of transactions.

Dr. Hanif Durad (5/6) 83
Partitioning Input Data : Example-2
(2/3)
Intermediate
result
Intermediate
result

Example -2 (3/3)
■ Each of the two tasks computes the frequencies of all the
itemsets in its respective subset of transactions.
■ The two sets of frequencies, which are the independent outputs
of the two tasks, represent intermediate results.
■ Combining the intermediate results by pair wise addition yields
the final result.
(6/6)
84

3.2.2.3 Partitioning both Input and Output
Data (1/2)
■
In some cases, in which it is possible to partition the output data, partitioning
of input data can offer additional concurrency.
Example: consider the 4-way decomposition shown in Fig. for computing
itemset frequencies.
■
■
■
■ both the transaction set and the frequencies are divided into two parts and
a different one of the four possible combinations is assigned to each of the
four tasks.
Each task then computes a local set of frequencies.
Finally, the outputs of Tasks 1 and 3 are added together, as are the outputs of
Tasks 2 and 4.

86
Partitioning Input and Output Data

Dr. Hanif Durad 87
3.2.2.4 Intermediate Data
Partitioning (1/8)
■ Partitioning intermediate data can sometimes lead to higher concurrency than
partitioning input or output data.
Often, the intermediate data are not generated explicitly in the serial
algorithm for solving the problem
Some restructuring of the original alg. May be required to use intermediate
data partitioning to induce a decomposition
■
■

Intermediate Data Partitioning
(2/8)
■ Example: matrix multiplication
■ Recall that the decompositions induced by a 2x 2 partitioning of the output matrix
C have a maximum degree of concurrency of four.
■ Increase the degree of concurrency by introducing an intermediate stage in which
8 tasks compute their respective product submatrices and store the results in a
temporary 3-d matrix D, as shown in Fig.
■ Submatrix Dk,i,j is the product of Ai,k and Bk,j.

Dr. Hanif Durad 89
Intermediate Data Partitioning: (3/8)
Example

(4/8)
■ A partitioning of the intermediate matrix D induces a decomposition into
eight tasks.
■ After the multiplication phase, a relatively inexpensive matrix addition step
can compute the result matrix C
■ All submatrices D*,i,j with the same second and third dimensions i and j are
added to yield Ci,j.

91
Intermediate Data Partitioning:
Example (5/8)
A decomposition of intermediate data structure
to the following decomposition into 8 + 4 tasks:
Stage I
leads
Stage II

(6/8)
■ The eight tasks numbered 1 through 8 in Fig. perform O(n3/8)
work each in multiplying (n/2 x n/2 ) submatrices of A and B
■ Then, four tasks numbered 9 through 12 spend O(n2/4) time each
in adding the appropriate (n/2 x n/2) submatrices of the
intermediate matrix D to yield the final result matrix C
■ Second Fig. shows the task dependency graph

Example (7/8)
81
Dr. Hanif Durad
Task 01: D1,1,1= A1,1 B1,1
Task 03: D1,1,2= A1,1 B1,2
Task 05: D1,2,1= A2,1 B1,1
Task 07: D1,2,2= A2,1 B1,2
Task 09: C1,1 = D1,1,1 + D2,1,1
Task 11: C2,1 = D1,2,1 + D2,2,1
Task 02: D2,1,1= A1,2 B2,1
Task 04: D2,1,2= A1,2 B2,2
Task 06: D2,2,1= A2,2 B2,1
Task 08: D2,2,2= A2,2 B2,2
Task 10: C1,2 = D1,1,2 + D2,1,2
Task 12: C2,,2 = D1,2,2 + D2,2,2

Example (8/8)
The task dependency graph for the
decomposition (shown in previous foil) into 12
tasks is as follows:
82
Dr. Hanif Durad

The Owner Computes Rule (1/2)
■
A decomposition based on partitioning output or input
data is also widely referred to as the owner-computes
rule.
The idea behind this rule is that each partition performs
all the computations involving data that it owns.
Depending on the nature of the data or the type of data-
partitioning, the owner-computes rule may mean
different things
■
■

The Owner Computes Rule (2/2)
■ When we assign partitions of the input data to
tasks, then the owner computes rule means that a
task performs all the computations that can be
done using these data.
■ If we partition the output data, then the owner-
computes rule means that a task computes all the
data in the partition assigned to it.

3.2.3 Exploratory Decomposition
(1/10)
■ In many cases, the decomposition of the problem goes hand-in-hand with its execution.
■ These problems typically involve the exploration (search) of a state space of solutions.
■ Problems in this class include a variety of
discrete optimization
programming, QAP
problems (0/1 integer
(quadratic assignment
problem), etc.), theorem proving, game playing,

Exploratory Decomposition-(2/10)
Example-1=> 15-puzzle problem
■ Consists of 15 tiles numbered 1 through 15 and one blank
tile placed in a 4 x 4 grid
A tile can be moved into the blank position from a
position adjacent to it, thus creating a blank in the tile's
original position.
Four moves are possible: up, down, left, and right.
The initial and final configurations of the tiles are
■
■
■

Exploratory Decomposition-
Example-1 =>15-puzzle problem (1/4)
■ Objective is to determine any sequence or a shortest
sequence of moves that transforms the initial configuration
to the final configuration
Fig. illustrates sample initial and final configurations and a
sequence of moves leading from the initial configuration to
the final configuration
■
(3/10)

■ The puzzle is typically solved using tree-search techniques.
■
■
■
■ From initial configration, all possible successor configurations
are generated.
May have 2, 3, or 4 possible successor configurations depending
occupation of the empty slot by one of its neighbors.
The task of finding a path from initial to final configuration now
translates to finding a path from one of these newly generated
configurations to the final configuration.
Since one of these newly generated configurations must be closer
(4/10)

■ The configuration space generated by the tree search is the state space
graph.
■
Each node of the graph is a configuration and each edge of the graph connects
configurations that can be reached from one another by a single move of a tile.
One method for solving this problem in parallel:
■
■ First, a few levels of configurations starting from the initial configuration are
generated serially until the search tree has a sufficient number of leaf nodes
Now each node is assigned to a task to explore further until at least one of them
finds a solution
As soon as one of the concurrent tasks finds a solution it can inform the others
to terminate their searches.
■
■
■
(5/10)

Exploratory Decomposition- (6/10)
Example-1 =>15-puzzle problem
(4/4)

Exploratory vs. data
decomposition (7/10)
■
The tasks induced by data-decomposition are performed
in their entirety & each task performs useful
computations towards the solution of the problem
In exploratory decomposition, unfinished tasks can be
terminated as soon as an overall solution is found.
■
■
■ Portion of the search space searched (& the aggregate amount
of work performed) by a parallel formulation can be different
from that searched by a serial algorithm
The work performed by the parallel formulation can be either
smaller or greater than that performed by the serial algorithm.

Example-2 (1/3) (8/10)
■ Example: consider a search space that has been partitioned
into four concurrent tasks as shown in Fig.
If the solution lies right at the beginning of the search
space corresponding to task 3
(Fig. (a)), then it will be found almost immediately by the
parallel formulation.
The serial algorithm would have found the solution only
■
■
■

Dr. Hanif Durad 105
Example-2 (2/3) (9/10)

Dr. Hanif Durad 106
■ On the other hand, if the solution lies towards the
end of the search space corresponding to task 1
(Fig (b)), then the parallel formulation will
perform almost four times the work of the serial
algorithm and will yield no speedup.
■ This change results in super- or sub-linear
speedups.
Example-2 (3/3) (10/10)

3.2.4 Speculative Decomposition
(1/9)
■ In some applications, dependencies between tasks are not
known a-priori.
For such applications, it is impossible to identify
independent tasks.
When a program may take one of many possible
computationally significant branches depending on the
output of other computations that precede it.
■
■

Dr. Hanif Durad 108
■ There are generally two approaches to dealing with
such applications: conservative approaches, which
identify independent tasks only when they are
guaranteed to not have dependencies, and, optimistic
approaches, which schedule tasks even when they may
potentially be erroneous.
Conservative approaches may yield little concurrency
and optimistic approaches may require roll-back
■
Speculative Decomposition
(2/9)

(3/9)
■ This scenario is similar to evaluating one or more of the branches
of a switch statement in C in parallel before the input for the
switch is available.
■
While one task is performing the computation that will eventually resolve
the switch, other tasks could pick up the multiple branches of the switch in
parallel.
When the input for the switch has finally been computed, the computation
corresponding to the correct branch would be used while that
corresponding to the other branches would be discarded.
The parallel run time is smaller than the serial run time by the amount of
■
■

■ This parallel formulation of a switch guarantees at least some
wasteful computation.
In order to minimize the wasted computation, a slightly
different formulation of speculative decomposition could be
used, especially in situations where one of the outcomes of
the switch is more likely than the others.
■
■ In this case, only the most promising branch is taken up a task in
parallel with the preceding computation.
In case the outcome of the switch is different from what was
anticipated, the computation is rolled back and the correct branch of
the switch is taken.
■
■

Speculative Decomposition:
Example-1
99
Dr. Hanif Durad
A classic example of speculative decomposition is
in discrete event simulation.
■The central data structure in a discrete event
simulation is a time-ordered event list.
■Events are extracted precisely in time order,
processed, and if required, resulting events are
inserted back into the event list.

Example-1
100
Dr. Hanif Durad
Consider your day today as a discrete event system - you
get up, get ready, drive to work, work, eat lunch, work
some more, drive back, eat dinner, and sleep.
Each of these events may be processed independently,
however, in driving to work, you might meet with an
unfortunate accident and not get to work at all.
Therefore, an optimistic scheduling of other events will
have to be rolled back.
■
■

Example-2
101
Dr. Hanif Durad
Another example is the simulation of a network
of nodes (for instance, an assembly line or a
computer network through which packets pass).
The task is to simulate the behavior of this
network for various inputs and node delay
parameters (note that networks may become
unstable for certain values of service rates,

Dr. Hanif Durad 114
Example -2

Example-2
■ The problem of simulating a sequence of input jobs on the
network described in this example appears inherently
sequential because the input of a typical component is the
output of another
However, we can define speculative tasks that start simulating
a subpart of the network, each assuming one of several possible
inputs to that stage.
When an actual input to a certain stage becomes available (as a
result of the completion of another selector task from a previous
stage),
■
■
■ then all or part of the work required to simulate this input would have
already been finished if the speculation was correct,
or the simulation of this stage is restarted with the most recent correct
input if the speculation was incorrect.
■

Speculative vs. exploratory
decomposition
■ In exploratory decomposition
■
the output of the multiple tasks originating at a branch is unknown.
the serial algorithm may explore different alternatives one after the other,
because the branch that may lead to the solution is not known beforehand
=> the parallel program may perform more, less, or the same amount of
aggregate work compared to the serial algorithm depending on the location of
the solution in the search space.
■
■ In speculative decomposition
■
■ the input at a branch leading to multiple parallel tasks is unknown the
serial algorithm would strictly perform only one of the tasks at a
speculative stage because when it reaches the beginning of that stage, it
knows exactly which branch to take.
=> a parallel program employing speculative decomposition performs more

3.2.5 Hybrid Decompositions
■ Decomposition techs are not exclusive, and can
often be combined together.
■ Often, a computation is structured into multiple
stages and it is sometimes necessary to apply
different types of decomposition in different stages.

Hybrid Decompositions
Example 1: Finding the minimum.
■Example 1: while finding the minimum of a
large set of n numbers,
■ a purely recursive decomposition may result in far
more tasks than the number of processes, P, available.
■ An efficient decomposition would partition the input
into P roughly equal parts and have each task
compute the minimum of the sequence assigned to it.

Dr. Hanif Durad 119
Hybrid Decompositions-Finding the
minimum.

Hybrid Decompositions-
Example 2: Quicksort in parallel.
■ recursive decomposition have been used in quicksort.
■
■
■ This formulation results in O(n) tasks for the problem of sorting a
sequence of size n.
But due to the dependencies among these tasks and due to uneven
sizes of the tasks, the effective concurrency is quite limited.
For example, the first task for splitting the input list into two parts
takes O(n) time, which puts an upper limit on the performance
gain possible via parallelization.
■ The step of splitting lists performed by tasks in parallel
quicksort can also be decomposed using the input
decomposition technique.
■The resulting hybrid decomposition that combines recursive

Input Data Decomposition
• Generally applicable if each output can be naturally computed
as a function of the input.
In many cases, this is the only natural decomposition because
the output is not clearly known a-priori (e.g., the problem of
finding the minimum in a list, sorting a given list, etc.).
• A task is associated with each input data partition. The task
performs as much of the computation with its part of the data.
Subsequent processing combines these partial results.

Exploratory Decomposition
In many cases, the decomposition of the problem
goes hand in-hand with its execution.
• These problems typically involve the exploration
(search) of a state space of solutions.
• Problems in this class include a variety of discrete
optimization problems (0/1 integer programming,
QAP, etc.),theorem proving, game playing, etc.

Speculative Decomposition: Example
Another example is the simulation of a network of nodes
(for instance, an assembly line or a computer network
through which packets pass). The task is to simulate the
behavior of this network for various inputs and node
delay parameters (note that networks may become
unstable for certain values of service rates, queue sizes,
etc.).

Hybrid Decompositions
Often, a mix of decomposition techniques is necessary for
decomposing a problem. Consider the following examples:
• In quicksort, recursive decomposition alone limits concurrency (Why?). A
mix of data and recursive decompositions is more desirable.
• In discrete event simulation, there might be concurrency in task processing.
A mix of speculative decomposition and data decomposition may work well.
• Even for simple problems like finding a minimum of a list of numbers, a mix
of data and recursive decomposition works well.

•Characteristics of
Tasks and Interactions

Characteristics of Tasks and Interactions
• We shall discuss the various properties of tasks and inter-
task interactions that affect the choice of a good mapping.
• Characteristics of Tasks
• Task Generation
• Task Sizes
• Knowledge of Task Sizes
• Size of Data Associated with Tasks

•Mapping Techniques
for Load Balancing

Task Generation
• Static task generation: Concurrent tasks can be
identified a-priori. Typical matrix operations, graph
algorithms, image processing applications, and other
regularly structured problems fall in this class. These can
typically be decomposed using data or recursive
decomposition techniques.
• Dynamic task generation: Tasks are generated as we
perform computation. A classic example of this is in
game playing - each 15 puzzle board is generated from
the previous one. These applications are typically
decomposed using exploratory or speculative
decompositions.

Task Sizes
• Task sizes may be uniform (i.e., all tasks are the same
size) or non-uniform.
• Non-uniform task sizes may be such that they can be
determined (or estimated) a-priori or not.
• Examples in this class include discrete optimization
problems, in which it is difficult to estimate the effective
size of a state space.

Size of Data Associated with Tasks
• The size of data associated with a task may be small or
large when viewed in the context of the size of the task.
• A small context of a task implies that an algorithm can
easily communicate this task to other processes
dynamically (e.g., the 15 puzzle).
• A large context ties the task to a process, or alternately,
an algorithm may attempt to reconstruct the context at
another processes as opposed to communicating the
context of the task (e.g., 0/1 integer programming).

Characteristics of Task Interactions
• Tasks may communicate with each other in various
ways. The associated dichotomy is:
• Static interactions: The tasks and their interactions are
known a-priori. These are relatively simpler to code into
programs.
• Dynamic interactions: The timing or interacting tasks
cannot be determined a-priori. These interactions are
harder to code, especitally, as we shall see, using
message passing APIs.

• Regular interactions: There is a definite pattern (in the
graph sense) to the interactions. These patterns can be
exploited for efficient implementation.
• Irregular interactions: Interactions lack well-defined
topologies.

Characteristics of Task Interactions: Example
A simple example of a regular static interaction pattern is in image
dithering. The underlying communication pattern is a structured (2-D
mesh) one as shown here:

Characteristics of Task Interactions: Example
The multiplication of a sparse matrix with a vector is a good example
of a static irregular interaction pattern. Here is an example of a
sparse matrix and its associated interaction pattern.

• Interactions may be read-only or read-write.
• In read-only interactions, tasks just read data items
associated with other tasks.
• In read-write interactions tasks read, as well as modily
data items associated with other tasks.
• In general, read-write interactions are harder to code,
since they require additional synchronization primitives.

• Interactions may be one-way or two-way.
• A one-way interaction can be initiated and accomplished
by one of the two interacting tasks.
• A two-way interaction requires participation from both
tasks involved in an interaction.
• One way interactions are somewhat harder to code in
message passing APIs.

Mapping Techniques
• Once a problem has been decomposed into concurrent
tasks, these must be mapped to processes (that can be
executed on a parallel platform).
• Mappings must minimize overheads.
• Primary overheads are communication and idling.
• Minimizing these overheads often represents
contradicting objectives.
• Assigning all work to one processor trivially minimizes
communication at the expense of significant idling.

Mapping Techniques for Minimum Idling
Mapping must simultaneously minimize idling and load balance.
Merely balancing load does not minimize idling.

Mapping Techniques for Minimum Idling
Mapping techniques can be static or dynamic.
• Static Mapping: Tasks are mapped to processes a-priori.
For this to work, we must have a good estimate of the
size of each task. Even in these cases, the problem may
be NP complete.
• Dynamic Mapping: Tasks are mapped to processes at
runtime. This may be because the tasks are generated
at runtime, or that their sizes are not known.

Schemes for Static Mapping
• Mappings based on data partitioning.
• Mappings based on task graph partitioning.
• Hybrid mappings.

Mappings Based on Data Partitioning
We can combine data partitioning with the ``owner-computes'' rule to
partition the computation into subtasks. The simplest data
decomposition schemes for dense matrices are 1-D block
distribution schemes.

Block Array Distribution Schemes
Block distribution schemes can be generalized to
higher dimensions as well.

Block Array Distribution Schemes: Examples
• For multiplying two dense matrices A and B, we can
partition the output matrix C using a block
decomposition.
• For load balance, we give each task the same number of
elements of C. (Note that each element of C
corresponds to a single dot product.)
• The choice of precise decomposition (1-D or 2-D) is
determined by the associated communication overhead.
• In general, higher dimension decomposition allows the
use of larger number of processes.

Data Sharing in Dense Matrix Multiplication

Cyclic and Block Cyclic Distributions
• If the amount of computation associated with data items
varies, a block decomposition may lead to significant
load imbalances.
• A simple example of this is in LU decomposition (or
Gaussian Elimination) of dense matrices.

LU Factorization of a Dense Matrix
A decomposition of LU factorization into 14 tasks - notice the
significant load imbalance.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:

Block Cyclic Distributions
• Variation of the block distribution scheme that can be used to
alleviate the load-imbalance and idling problems.
• Partition an array into many more blocks than the number of
available processes.
• Blocks are assigned to processes in a round-robin manner so that
each process gets several non-adjacent blocks.

Block-Cyclic Distribution for Gaussian
Elimination
The active part of the matrix in Gaussian Elimination changes.
By assigning blocks in a block-cyclic fashion, each processor
receives blocks from different parts of the matrix.

Block-Cyclic Distribution: Examples
One- and two-dimensional block-cyclic distributions among 4
processes.

Block-Cyclic Distribution
• A cyclic distribution is a special case in which block size is one.
• A block distribution is a special case in which block size is n/p ,
where n is the dimension of the matrix and p is the number of
processes.

Graph Partitioning Dased Data Decomposition
• In case of sparse matrices, block decompositions are
more complex.
• Consider the problem of multiplying a sparse matrix with
a vector.
• The graph of the matrix is a useful indicator of the work
(number of nodes) and communication (the degree of
each node).
• In this case, we would like to partition the graph so as to
assign equal number of nodes to each process, while
minimizing edge count of the graph partition.

Partitioning the Graph of Lake Superior
Random Partitioning
Partitioning for minimum edge-cut.

Mappings Based on Task Paritioning
• Partitioning a given task-dependency graph across
processes.
• Determining an optimal mapping for a general task-
dependency graph is an NP-complete problem.
• Excellent heuristics exist for structured graphs.

Task Paritioning: Mapping a Binary Tree
Dependency Graph
Example illustrates the dependency graph of one view of quick-sort
and how it can be assigned to processes in a hypercube.

Task Paritioning: Mapping a Sparse Graph
Sparse graph for computing a sparse matrix-vector product and
its mapping.

Hierarchical Mappings
• Sometimes a single mapping technique is inadequate.
• For example, the task mapping of the binary tree
(quicksort) cannot use a large number of processors.
• For this reason, task mapping can be used at the top
level and data partitioning within each level.

An example of task partitioning at top level with data
partitioning at the lower level.

Schemes for Dynamic Mapping
• Dynamic mapping is sometimes also referred to as
dynamic load balancing, since load balancing is the
primary motivation for dynamic mapping.
• Dynamic mapping schemes can be centralized or
distributed.

Centralized Dynamic Mapping
• Processes are designated as masters or slaves.
• When a process runs out of work, it requests the master
for more work.
• When the number of processes increases, the master
may become the bottleneck.
• To alleviate this, a process may pick up a number of
tasks (a chunk) at one time. This is called Chunk
scheduling.
• Selecting large chunk sizes may lead to significant load
imbalances as well.

Distributed Dynamic Mapping
• Each process can send or receive work from other
processes.
• This alleviates the bottleneck in centralized schemes.
• There are four critical questions: how are sensing and
receiving processes paired together, who initiates work
transfer, how much work is transferred, and when is a
transfer triggered?
• Answers to these questions are generally application
specific. We will look at some of these techniques later in
this class.

Minimizing Interaction Overheads
• Maximize data locality: Where possible, reuse
intermediate data. Restructure computation so that data
can be reused in smaller time windows.
• Minimize volume of data exchange: There is a cost
associated with each word that is communicated. For
this reason, we must minimize the volume of data
communicated.
• Minimize frequency of interactions: There is a startup
cost associated with each interaction. Therefore, try to
merge multiple interactions to one, where possible.

Minimizing Interaction Overheads (continued)
• Overlapping computations with interactions: Use non-
blocking communications, multithreading, and
prefetching to hide latencies.
• Replicating data or computations.
• Using group communications instead of point-to-point
primitives.
• Overlap interactions with other interactions.

Parallel Algorithm Models
An algorithm model is a way of structuring a parallel
algorithm by selecting a decomposition and mapping
technique and applying the appropriate strategy to
minimize interactions.
• Data Parallel Model: Tasks are statically (or semi-
statically) mapped to processes and each task performs
similar operations on different data.
• Task Graph Model: Starting from a task dependency
graph, the interrelationships among the tasks are utilized

Parallel Algorithm Models (continued)
• Master-Slave Model: One or more processes generate
work and allocate it to worker processes. This allocation
may be static or dynamic.
• Pipeline / Producer-Comsumer Model: A stream of data
is passed through a succession of processes, each of
which perform some task on it.
• Hybrid Models: A hybrid model may be composed either
of multiple models applied hierarchically or multiple
models applied sequentially to different phases of a
parallel algorithm.

Unit-3.ppt

Recomendados

Recomendados

Más contenido relacionado

Similar a Unit-3.ppt

Similar a Unit-3.ppt (20)

Último

Último (20)

Unit-3.ppt