Intel® MPI Library e OpenMP* - Intel Software Conference 2013

© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
MPI and OpenMP
Reducing effort for parallel software development
August, 2013
1
Werner Krotz-Vogel

© 2009 Mathew J. Sottile, Timothy G. Mattson, and Craig E 2
Objectives
• Design parallel applications from serial codes
• Determine appropriate decomposition strategies for
applications
• Choose applicable parallel model for implementation
• MPI and OpenMP

Why MPI and OpenMP ?
• Performance ~ Die Area
- 4x the Silicon Die area gives 2x the performance in
one core, but 4x the performance when dedicated to 4
cores
- Power ~ Voltage2 (voltage is roughly prop. to clock
freq.)
Conclusion (with respect to above Pollack’s rule)
- Multiple cores is a powerful handle to adjust
“Performance/Watt”
 Parallel Hardware
 Parallel Software

4
Parallel Programming: Algorithms
Distributed Versus Shared Memory
CPU
Memory
Bus
Memory
C
P
U
C
P
U
C
P
U
C
P
U
CPU
Memory
CPU
Memory
CPU
Memory
Network
Message Passing Threads
Multiple processes
Share data with messages
MPI*
Single process
Concurrent execution
Shared memory and resources
Explicit threads, OpenMP*

5
Designing Parallel Programs
•Partition
– Divide problem into tasks
•Communicate
– Determine amount and pattern
of communication
•Agglomerate
– Combine tasks
•Map
– Assign agglomerated
tasks to physical processors
The
Problem
Initial tasks
Communication
Combined Tasks
Final Program

6
1. Partitioning
•Discover as much parallelism as possible
• Independent computations and/or data
• Maximize number of primitive tasks
•Functional decomposition
• Divide the computation, then associate the data
•Domain decomposition
• Divide the data into pieces, then associate
computation
Initial tasks

7
Decomposition Methods
•Functional
decomposition
– Focusing on
computations can reveal
structure in a problem
Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics
Laboratory, Engineer Research and Development Center (ERDC).
Domain decomposition
• Focus on largest or most
frequently accessed data
structure
• Data parallelism
• Same operation(s) applied to all data
Atmosphere Model
Ocean
Model
Land Surface
Model
Hydrology
Model

8
2. Communication
•Determine the communication pattern
between primitive tasks
• What data need to be shared?
•Point-to-point
• One thread to another
•Collective
• Groups of threads sharing data
•Execution order dependencies are
communication
Communication

9
3. Agglomeration
•Group primitive tasks in order to:
• Improve performance/granularity
– Localize communication
• Put tasks that communicate in the same group
– Maintain scalability of design
• Gracefully handle changes in data set size or
number of processors
– Simplify programming and maintenance
Combined Tasks

10
4. Mapping
•Assign tasks to processors in order to:
– Maximize processor utilization
– Minimize inter-processor communication
•One task or multiple tasks per processor?
•Static or dynamic assignment?
•Most applicable to message passing
– Programmer can map tasks to threads
Final Program

11
What Is Not Parallel•Subprograms with “state” or with side effects
– Pseudo-random number generators
– File I/O routines
– Output on screen
•Loops with data dependencies
– Variables written in one iteration and read in another
– Quick test: Reverse loop iterations
Loop carried – Value carried from one iteration to the next
Induction variables – Incremented each trip through loop
Reductions – Summation; collapse array to single value
Recurrence – Feed information forward

12
Introduction to MPI
What is MPI ?
CPU
Private
Memory
CPU
Private
Memory
CPU
Private
Memory
Node 0 Node 1 Node n

13
Introduction to MPI
The Distributed-Memory Model
•Characteristics of distributed memory
machines
• No common address space
• High-latency interconnection network
• Explicit message exchange

14
Introduction to MPI
Message Passing Interface (MPI)
•Depending on the interconnection network,
clusters exhibit different interfaces to the
network, e.g.
• Ethernet: UNIX sockets
• InfiniBand: OFED, Verbs
•MPI provides an abstraction to these interfaces
• Generic communication interface
• Logical ranks (no physical addresses)
• Supportive functions (e.g. parallel file I/O)

15
Introduction to MPI
“Hello World” in Fortran
•program hello
•include 'mpif.h‘
•integer mpierr, rank, procs
•call MPI_Init(mpierr)
•call MPI_Comm_size(MPI_COMM_WORLD, procs, mpierr)
•call MPI_Comm_rank(MPI_COMM_WORLD, rank, mpierr)
•write (*,*) 'Hello world from ', rank, 'of', procs
•call MPI_Finalize(mpierr)
•end program hello

16
Introduction to MPI
Compilation and Execution
•MPI implementations ship with a compiler
wrapper:
• mpiicc –o helloc hello.c
• mpiifort –o hellof hello.f
•Wrapper correctly calls native C/Fortran
compiler and passes along MPI specifics (e.g.
library)
•Wrappers usually accept the same compiler
options as the underlying native compiler, e.g.
• mpiicc –O2 –fast –o module.o –c module.c

17
Introduction to MPI
Compilation and Execution
•To run the “Hello World”, use:
• mpirun –np 8 helloc
•It provides portable, transparent application
start-up
• connect to the cluster nodes for execution
• launch processes on the nodes
• pass along information how to reach others
•When mpirun returns, execution was
completed.
•Note: mpirun is implementation-specific

18
Introduction to MPI
Output of “Hello World”
• Hello world from 0 of 8
No particular ordering
of process execution!
If needed, programmer
must ensure ordering
by explicit comm’.

19
Introduction to MPI
Sending Messages (Blocking)
• subroutine master(array, length)
• include 'mpif.h'
• double precision array(1)
• integer length
• double precision sum, globalsum
• integer rank, procs, mpierr, size
• call MPI_Comm_size(MPI_COMM_WORLD, procs, mpierr)
• size = length / procs
• do rank = 1,procs-1
• call MPI_Send(size, 1, MPI_INTEGER, rank, 0,
• & MPI_COMM_WORLD, mpierr)
• call MPI_Send(array(rank*size+1:rank*size+size), size,
• & MPI_DOUBLE_PRECISION, rank, 1, MPI_COMM_WORLD, mpierr)
• enddo
Example only correct, iff
length is a multiple of procs.

20
Introduction to MPI
MPI_Send
•int MPI_Send(void* buf, int count, MPI_Datatype
dtype, int dest, int tag, MPI_Comm
comm)
•MPI_SEND(BUF, COUNT, DTYPE, DEST, TAG, COMM,IERR)
<type> BUF(*)
INTEGER COUNT, DTYPE, DEST, TAG, COMM, IERR
•Blocking message delivery
• blocks until receiver has completely
received the message
• effectively synchronizes sender and
receiver

21
Introduction to MPI
MPI_Send
buf Pointer to message data
(e.g. pointer to first element of an array)
count Length of the message in elements
dtype Data type of the message content
(size of data type x count = message size)
dest Rank of the destination process
tag “Type” of the message
comm Handle to the communication group
ierr Fortran: OUT argument for error code
return value C/C++: error code

22
Introduction to MPI
MPI Data Type C Data Type
MPI_BYTE
MPI_CHAR signed char
MPI_DOUBLE double
MPI_FLOAT float
MPI_INT int
MPI_LONG long
MPI_LONG_DOUBLE long double
MPI_PACKED
MPI_SHORT short
MPI_UNSIGNED_SHORT unsigned short
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long
MPI_UNSIGNED_CHAR unsigned char
MPI provides predefined
data types that must be
specified when passing
messages.
MPI Data Types for C

23
Introduction to MPI
Communication Wildcards
•MPI defines a set of wildcards to be specified with communication
primitives:
MPI_ANY_SOURCE Matches any logical rank when receiving a
message with MPI_Recv
(message status contains actual sender)
MPI_ANY_TAG Matches any message tag when receiving
a message
(message status contains actual tag)
MPI_PROC_NULL Special value indicating non-existent
process rank (messages are not delivered
or received for this special rank)

24
Introduction to MPI
Blocking Communication
•MPI_Send and MPI_Recv are blocking
operations MPI_Send
MPI_Recv
Computation
Communication
Process A
Process B

25
Introduction to MPI
Non-blocking Communication
•MPI_Isend and MPI_Irecv are blocking
operations MPI_Isend
MPI_Irecv
Computation
Communication
Process A
Process B
MPI_Wait
MPI_Wait

26
Introduction to MPI
‘Collectives’, e.g. MPI_Reduce
•int MPI_Reduce(void* sendbuf, void* recvbuf,
int count, MPI_Datatype dtype,
MPI_Op op, int root, MPI_Comm comm)
•MPI_REDUCE(SENDBUF, RECVBUF, COUNT, DTYPE, OP,
ROOT, COMM, IERR)
<type> SENDBUF(*), RECVBUF(*)
INTEGER COUNT, DTYPE, OP, ROOT, COMM, IERR
•Global operation that accumulates data at
the processors into a global result at the
root process.
• All processes have to reach the same
MPI_Reduce invocation.
• Otherwise deadlocks and undefined
behavior may occur.

27
Introduction to MPI
MPI_Reduce – Operators
MPI_MAX maximum
MPI_MIN minimum
MPI_SUM sum
MPI_PROD product
MPI_LAND / MPI_BAND logical and / bit-wise and
MPI_LOR / MPI_BOR logical or / bit-wise or
MPI_LXOR MPI_BXOR logical excl. or / bit-wise excl. or
MPI_MAXLOC max value and location
MPI_MINLOC min value and location

28
Introduction to MPI
MPI _Barrier
•int MPI_Barrier(MPI_Comm comm )
•MPI_BARRIER(COMM, IERROR)
INTEGER COMM, IERROR
•Global operation that synchronizes all
participating processes.
• All processes have to reach an MPI_Barrier
invocation.
• Otherwise deadlocks and undefined
behavior may occur.

29
Introduction to MPI
Stencil Computation example
•Some algorithms (e.g. Jacobi, Gauss-
Seidel) process data in with a stencil:
• grid(i,j) = 0.25 * (grid(i+1,j) + grid(i-1,j) +
grid(i,j+1) + grid(i,j-1))
•Data access pattern:i-1,j
i+1,j
i,j+1i,j-1 i,j

30
Introduction to MPI
MPI features not covered
• One-sided communication
– MPI_Put, MPI_Get
– Uses Remote Memory Access (RMA)
– Separates communication from synchronization
• User-defined datatypes, strided messages
• Dynamic process spawning: MPI_Spawn
Collective communication can be used across disjoint intra-
communicators
• Parallel I/O
• MPI 3.0 (released Sept 21, 2012)

31
What Is OpenMP?
• Portable, shared-memory threading API
–Fortran, C, and C++
–Multi-vendor support for both Linux and
Windows
• Standardizes task & loop-level parallelism
• Supports coarse-grained parallelism
• Combines serial and parallel code in single
source
• Standardizes ~ 20 years of compiler-
directed threading experience
http://www.openmp.org
Current spec is OpenMP 4.0
July 31, 2013
(combined C/C++ and Fortran)
Introduction to OpenMP

32
OpenMP Programming Model
Fork-Join Parallelism:
• Master thread spawns a team of threads as needed
• Parallelism is added incrementally: that is, the sequential program
evolves into a parallel program
Parallel Regions
Master
Thread

33
A Few Syntax Details to Get Started
• Most of the constructs in OpenMP are compiler
directives or pragmas
– For C and C++, the pragmas take the form:
#pragma omp construct [clause [clause]…]
– For Fortran, the directives take one of the
forms:
C$OMP construct [clause [clause]…]
!$OMP construct [clause [clause]…]
*$OMP construct [clause [clause]…]
• Header file or Fortran 90 module
#include “omp.h”
use omp_lib

34
Worksharing
• Worksharing is the general term used in
OpenMP to describe distribution of work across
threads.
• Three examples of worksharing in OpenMP are:
• omp for construct
• omp sections construct
• omp task construct
Automatically divides work
among threads

35
‘omp for’ Construct
• Threads are assigned an
independent set of iterations
• Threads must wait at the
end of work-sharing
construct
#pragma omp parallel
#pragma omp for
Implicit barrier
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
i = 11
i = 12
// assume N=12
#pragma omp for
for(i = 1, i < N+1, i++)
c[i] = a[i] + b[i];

36
New Addition to OpenMP
Tasks
Main change for OpenMP 3.0
• Allows parallelization of irregular problems
• unbounded loops
• recursive algorithms
• producer/consume
Device Constructs
Main change for OpenMP 4.0
• Allows to describe regions of code where data
and/or computation should be moved to another
computing device.

37
What are tasks?
• Tasks are independent units of work
• Threads are assigned to perform the work of
each task
– Tasks may be deferred
• Tasks may be executed immediately
• The runtime system decides which of the
above
– Tasks are composed of:
• code to execute
• data environment
• internal control variables (ICV)
Serial Parallel

38
Simple Task Example
A pool of 8 threads is created
here
// assume 8 threads
{
#pragma omp single private(p)
{
…
while (p) {
#pragma omp task
{
processwork(p);
}
p = p->next;
}
}
}
One thread gets to execute
the while loop
The single “while loop” thread
creates a task for each
instance of processwork()

39
Task Construct – Explicit Task View
– A team of threads is created
at the omp parallel
construct
– A single thread is chosen to
execute the while loop – lets
call this thread “L”
– Thread L operates the while
loop, creates tasks, and
fetches next pointers
– Each time L crosses the omp
task construct it generates a
new task and has a thread
assigned to it
– Each task runs in its own
thread
– All tasks complete at the
barrier at the end of the
parallel region’s single
construct
{
#pragma omp single
{ // block 1
node * p = head;
while (p) { //block 2
#pragma omp task
process(p);
p = p->next; //block 3
}
}
}

40
OpenMP* Reduction Clause
• reduction (op : list)
• The variables in “list” must be shared in the
enclosing parallel region
• Inside parallel or work-sharing construct:
• A PRIVATE copy of each list variable is created
and initialized depending on the “op”
• These copies are updated locally by threads
• At end of construct, local copies are combined
through “op” into a single value and combined
with the value in the original SHARED variable

41
Reduction Example
• Local copy of sum for each thread
• All local copies of sum added together and
stored in “global” variable
#pragma omp parallel for reduction(+:sum)
for(i=0; i<N; i++) {
sum += a[i] * b[i];
}

10
20
40
80
160
320
640
1280
2560
5120
1 2 4 8 16 32 64 128
Runtimeinseconds
Number of nodes
1 PPN
1 PPN / 2 TPP
1 PPN / 4 TPP
1 PPN / 8 TPP
2 PPN
2 PPN / 2 TPP
2 PPN / 4 TPP
4 PPN
4 PPN / 2 TPP
8 PPN
Why Hybrid Programming?
OpenMP/MPI
PPN = processes per node
TPP = threads per process
53% improvement
over MPI
Simulation of Free-Surface Flows, Finite Element CFD solver written in Fortran and C
Figure kindly provided by HPC group of the Center of Computing and Communication, RWTH Aachen,
Germany

The Good, the Bad, and the Ugly
The Good
• OpenMP and MPI blend well with each other if certain rules are respected
by programmers.
The Bad
• Programmers need to be aware of the issues of hybrid programming, e.g.
using thread-safe libraries and MPI.
The Ugly
• What’s the best setting for PPN and TPP for a given machine?
MPI and OpenMP hybrid programs can greatly
improve performance of parallel codes !
43

44

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND
INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,
COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, Xeon Phi, VTune, and Cilk are trademarks
of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
45
8/21/201
Intel Confidential - Use under NDA only
45

Intel® MPI Library e OpenMP* - Intel Software Conference 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Intel® MPI Library e OpenMP* - Intel Software Conference 2013

Similar to Intel® MPI Library e OpenMP* - Intel Software Conference 2013 (20)

More from Intel Software Brasil

More from Intel Software Brasil (20)

Recently uploaded

Recently uploaded (20)

Intel® MPI Library e OpenMP* - Intel Software Conference 2013