SlideShare a Scribd company logo
1 of 45
Download to read offline
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
MPI and OpenMP
Reducing effort for parallel software development
August, 2013
1
Werner Krotz-Vogel
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
© 2009 Mathew J. Sottile, Timothy G. Mattson, and Craig E 2
Objectives
• Design parallel applications from serial codes
• Determine appropriate decomposition strategies for
applications
• Choose applicable parallel model for implementation
• MPI and OpenMP
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Why MPI and OpenMP ?
• Performance ~ Die Area
- 4x the Silicon Die area gives 2x the performance in
one core, but 4x the performance when dedicated to 4
cores
- Power ~ Voltage2 (voltage is roughly prop. to clock
freq.)
Conclusion (with respect to above Pollack’s rule)
- Multiple cores is a powerful handle to adjust
“Performance/Watt”
 Parallel Hardware
 Parallel Software
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
4
Parallel Programming: Algorithms
Distributed Versus Shared Memory
CPU
Memory
Bus
Memory
C
P
U
C
P
U
C
P
U
C
P
U
CPU
Memory
CPU
Memory
CPU
Memory
Network
Message Passing Threads
Multiple processes
Share data with messages
MPI*
Single process
Concurrent execution
Shared memory and resources
Explicit threads, OpenMP*
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
5
Parallel Programming: Algorithms
Designing Parallel Programs
•Partition
– Divide problem into tasks
•Communicate
– Determine amount and pattern
of communication
•Agglomerate
– Combine tasks
•Map
– Assign agglomerated
tasks to physical processors
The
Problem
Initial tasks
Communication
Combined Tasks
Final Program
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
6
Parallel Programming: Algorithms
1. Partitioning
•Discover as much parallelism as possible
• Independent computations and/or data
• Maximize number of primitive tasks
•Functional decomposition
• Divide the computation, then associate the data
•Domain decomposition
• Divide the data into pieces, then associate
computation
Initial tasks
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
7
Parallel Programming: Algorithms
Decomposition Methods
•Functional
decomposition
– Focusing on
computations can reveal
structure in a problem
Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics
Laboratory, Engineer Research and Development Center (ERDC).
Domain decomposition
• Focus on largest or most
frequently accessed data
structure
• Data parallelism
• Same operation(s) applied to all data
Atmosphere Model
Ocean
Model
Land Surface
Model
Hydrology
Model
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
8
Parallel Programming: Algorithms
2. Communication
•Determine the communication pattern
between primitive tasks
• What data need to be shared?
•Point-to-point
• One thread to another
•Collective
• Groups of threads sharing data
•Execution order dependencies are
communication
Communication
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
9
Parallel Programming: Algorithms
3. Agglomeration
•Group primitive tasks in order to:
• Improve performance/granularity
– Localize communication
• Put tasks that communicate in the same group
– Maintain scalability of design
• Gracefully handle changes in data set size or
number of processors
– Simplify programming and maintenance
Combined Tasks
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
10
Parallel Programming: Algorithms
4. Mapping
•Assign tasks to processors in order to:
– Maximize processor utilization
– Minimize inter-processor communication
•One task or multiple tasks per processor?
•Static or dynamic assignment?
•Most applicable to message passing
– Programmer can map tasks to threads
Final Program
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
11
Parallel Programming: Algorithms
What Is Not Parallel•Subprograms with “state” or with side effects
– Pseudo-random number generators
– File I/O routines
– Output on screen
•Loops with data dependencies
– Variables written in one iteration and read in another
– Quick test: Reverse loop iterations
Loop carried – Value carried from one iteration to the next
Induction variables – Incremented each trip through loop
Reductions – Summation; collapse array to single value
Recurrence – Feed information forward
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
12
Introduction to MPI
What is MPI ?
CPU
Private
Memory
CPU
Private
Memory
CPU
Private
Memory
Node 0 Node 1 Node n
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
13
Introduction to MPI
The Distributed-Memory Model
•Characteristics of distributed memory
machines
• No common address space
• High-latency interconnection network
• Explicit message exchange
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
14
Introduction to MPI
Message Passing Interface (MPI)
•Depending on the interconnection network,
clusters exhibit different interfaces to the
network, e.g.
• Ethernet: UNIX sockets
• InfiniBand: OFED, Verbs
•MPI provides an abstraction to these interfaces
• Generic communication interface
• Logical ranks (no physical addresses)
• Supportive functions (e.g. parallel file I/O)
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
15
Introduction to MPI
“Hello World” in Fortran
•program hello
•include 'mpif.h‘
•integer mpierr, rank, procs
•call MPI_Init(mpierr)
•call MPI_Comm_size(MPI_COMM_WORLD, procs, mpierr)
•call MPI_Comm_rank(MPI_COMM_WORLD, rank, mpierr)
•write (*,*) 'Hello world from ', rank, 'of', procs
•call MPI_Finalize(mpierr)
•end program hello
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
16
Introduction to MPI
Compilation and Execution
•MPI implementations ship with a compiler
wrapper:
• mpiicc –o helloc hello.c
• mpiifort –o hellof hello.f
•Wrapper correctly calls native C/Fortran
compiler and passes along MPI specifics (e.g.
library)
•Wrappers usually accept the same compiler
options as the underlying native compiler, e.g.
• mpiicc –O2 –fast –o module.o –c module.c
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
17
Introduction to MPI
Compilation and Execution
•To run the “Hello World”, use:
• mpirun –np 8 helloc
•It provides portable, transparent application
start-up
• connect to the cluster nodes for execution
• launch processes on the nodes
• pass along information how to reach others
•When mpirun returns, execution was
completed.
•Note: mpirun is implementation-specific
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
18
Introduction to MPI
Output of “Hello World”
• Hello world from 0 of 8
• Hello world from 1 of 8
• Hello world from 4 of 8
• Hello world from 6 of 8
• Hello world from 5 of 8
• Hello world from 7 of 8
• Hello world from 2 of 8
• Hello world from 3 of 8
No particular ordering
of process execution!
If needed, programmer
must ensure ordering
by explicit comm’.
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
19
Introduction to MPI
Sending Messages (Blocking)
• subroutine master(array, length)
• include 'mpif.h'
• double precision array(1)
• integer length
• double precision sum, globalsum
• integer rank, procs, mpierr, size
• call MPI_Comm_size(MPI_COMM_WORLD, procs, mpierr)
• size = length / procs
• do rank = 1,procs-1
• call MPI_Send(size, 1, MPI_INTEGER, rank, 0,
• & MPI_COMM_WORLD, mpierr)
• call MPI_Send(array(rank*size+1:rank*size+size), size,
• & MPI_DOUBLE_PRECISION, rank, 1, MPI_COMM_WORLD, mpierr)
• enddo
Example only correct, iff
length is a multiple of procs.
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
20
Introduction to MPI
MPI_Send
•int MPI_Send(void* buf, int count, MPI_Datatype
dtype, int dest, int tag, MPI_Comm
comm)
•MPI_SEND(BUF, COUNT, DTYPE, DEST, TAG, COMM,IERR)
<type> BUF(*)
INTEGER COUNT, DTYPE, DEST, TAG, COMM, IERR
•Blocking message delivery
• blocks until receiver has completely
received the message
• effectively synchronizes sender and
receiver
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
21
Introduction to MPI
MPI_Send
buf Pointer to message data
(e.g. pointer to first element of an array)
count Length of the message in elements
dtype Data type of the message content
(size of data type x count = message size)
dest Rank of the destination process
tag “Type” of the message
comm Handle to the communication group
ierr Fortran: OUT argument for error code
return value C/C++: error code
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
22
Introduction to MPI
MPI Data Type C Data Type
MPI_BYTE
MPI_CHAR signed char
MPI_DOUBLE double
MPI_FLOAT float
MPI_INT int
MPI_LONG long
MPI_LONG_DOUBLE long double
MPI_PACKED
MPI_SHORT short
MPI_UNSIGNED_SHORT unsigned short
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long
MPI_UNSIGNED_CHAR unsigned char
MPI provides predefined
data types that must be
specified when passing
messages.
MPI Data Types for C
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
23
Introduction to MPI
Communication Wildcards
•MPI defines a set of wildcards to be specified with communication
primitives:
MPI_ANY_SOURCE Matches any logical rank when receiving a
message with MPI_Recv
(message status contains actual sender)
MPI_ANY_TAG Matches any message tag when receiving
a message
(message status contains actual tag)
MPI_PROC_NULL Special value indicating non-existent
process rank (messages are not delivered
or received for this special rank)
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
24
Introduction to MPI
Blocking Communication
•MPI_Send and MPI_Recv are blocking
operations MPI_Send
MPI_Recv
Computation
Communication
Process A
Process B
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
25
Introduction to MPI
Non-blocking Communication
•MPI_Isend and MPI_Irecv are blocking
operations MPI_Isend
MPI_Irecv
Computation
Communication
Process A
Process B
MPI_Wait
MPI_Wait
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
26
Introduction to MPI
‘Collectives’, e.g. MPI_Reduce
•int MPI_Reduce(void* sendbuf, void* recvbuf,
int count, MPI_Datatype dtype,
MPI_Op op, int root, MPI_Comm comm)
•MPI_REDUCE(SENDBUF, RECVBUF, COUNT, DTYPE, OP,
ROOT, COMM, IERR)
<type> SENDBUF(*), RECVBUF(*)
INTEGER COUNT, DTYPE, OP, ROOT, COMM, IERR
•Global operation that accumulates data at
the processors into a global result at the
root process.
• All processes have to reach the same
MPI_Reduce invocation.
• Otherwise deadlocks and undefined
behavior may occur.
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
27
Introduction to MPI
MPI_Reduce – Operators
MPI_MAX maximum
MPI_MIN minimum
MPI_SUM sum
MPI_PROD product
MPI_LAND / MPI_BAND logical and / bit-wise and
MPI_LOR / MPI_BOR logical or / bit-wise or
MPI_LXOR MPI_BXOR logical excl. or / bit-wise excl. or
MPI_MAXLOC max value and location
MPI_MINLOC min value and location
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
28
Introduction to MPI
MPI _Barrier
•int MPI_Barrier(MPI_Comm comm )
•MPI_BARRIER(COMM, IERROR)
INTEGER COMM, IERROR
•Global operation that synchronizes all
participating processes.
• All processes have to reach an MPI_Barrier
invocation.
• Otherwise deadlocks and undefined
behavior may occur.
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
29
Introduction to MPI
Stencil Computation example
•Some algorithms (e.g. Jacobi, Gauss-
Seidel) process data in with a stencil:
• grid(i,j) = 0.25 * (grid(i+1,j) + grid(i-1,j) +
grid(i,j+1) + grid(i,j-1))
•Data access pattern:i-1,j
i+1,j
i,j+1i,j-1 i,j
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
30
Introduction to MPI
MPI features not covered
• One-sided communication
– MPI_Put, MPI_Get
– Uses Remote Memory Access (RMA)
– Separates communication from synchronization
• User-defined datatypes, strided messages
• Dynamic process spawning: MPI_Spawn
Collective communication can be used across disjoint intra-
communicators
• Parallel I/O
• MPI 3.0 (released Sept 21, 2012)
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
31
What Is OpenMP?
• Portable, shared-memory threading API
–Fortran, C, and C++
–Multi-vendor support for both Linux and
Windows
• Standardizes task & loop-level parallelism
• Supports coarse-grained parallelism
• Combines serial and parallel code in single
source
• Standardizes ~ 20 years of compiler-
directed threading experience
http://www.openmp.org
Current spec is OpenMP 4.0
July 31, 2013
(combined C/C++ and Fortran)
Introduction to OpenMP
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
32
OpenMP Programming Model
Fork-Join Parallelism:
• Master thread spawns a team of threads as needed
• Parallelism is added incrementally: that is, the sequential program
evolves into a parallel program
Parallel Regions
Master
Thread
Introduction to OpenMP
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
33
A Few Syntax Details to Get Started
• Most of the constructs in OpenMP are compiler
directives or pragmas
– For C and C++, the pragmas take the form:
#pragma omp construct [clause [clause]…]
– For Fortran, the directives take one of the
forms:
C$OMP construct [clause [clause]…]
!$OMP construct [clause [clause]…]
*$OMP construct [clause [clause]…]
• Header file or Fortran 90 module
#include “omp.h”
use omp_lib
Introduction to OpenMP
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
34
Worksharing
• Worksharing is the general term used in
OpenMP to describe distribution of work across
threads.
• Three examples of worksharing in OpenMP are:
• omp for construct
• omp sections construct
• omp task construct
Automatically divides work
among threads
Introduction to OpenMP
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
35
‘omp for’ Construct
• Threads are assigned an
independent set of iterations
• Threads must wait at the
end of work-sharing
construct
#pragma omp parallel
#pragma omp for
Implicit barrier
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
i = 10
i = 11
i = 12
// assume N=12
#pragma omp parallel
#pragma omp for
for(i = 1, i < N+1, i++)
c[i] = a[i] + b[i];
Introduction to OpenMP
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
36
New Addition to OpenMP
Tasks
Main change for OpenMP 3.0
• Allows parallelization of irregular problems
• unbounded loops
• recursive algorithms
• producer/consume
Device Constructs
Main change for OpenMP 4.0
• Allows to describe regions of code where data
and/or computation should be moved to another
computing device.
Introduction to OpenMP
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
37
What are tasks?
• Tasks are independent units of work
• Threads are assigned to perform the work of
each task
– Tasks may be deferred
• Tasks may be executed immediately
• The runtime system decides which of the
above
– Tasks are composed of:
• code to execute
• data environment
• internal control variables (ICV)
Serial Parallel
Introduction to OpenMP
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
38
Simple Task Example
A pool of 8 threads is created
here
#pragma omp parallel
// assume 8 threads
{
#pragma omp single private(p)
{
…
while (p) {
#pragma omp task
{
processwork(p);
}
p = p->next;
}
}
}
One thread gets to execute
the while loop
The single “while loop” thread
creates a task for each
instance of processwork()
Introduction to OpenMP
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
39
Task Construct – Explicit Task View
– A team of threads is created
at the omp parallel
construct
– A single thread is chosen to
execute the while loop – lets
call this thread “L”
– Thread L operates the while
loop, creates tasks, and
fetches next pointers
– Each time L crosses the omp
task construct it generates a
new task and has a thread
assigned to it
– Each task runs in its own
thread
– All tasks complete at the
barrier at the end of the
parallel region’s single
construct
#pragma omp parallel
{
#pragma omp single
{ // block 1
node * p = head;
while (p) { //block 2
#pragma omp task
process(p);
p = p->next; //block 3
}
}
}
Introduction to OpenMP
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
40
OpenMP* Reduction Clause
• reduction (op : list)
• The variables in “list” must be shared in the
enclosing parallel region
• Inside parallel or work-sharing construct:
• A PRIVATE copy of each list variable is created
and initialized depending on the “op”
• These copies are updated locally by threads
• At end of construct, local copies are combined
through “op” into a single value and combined
with the value in the original SHARED variable
Introduction to OpenMP
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
41
Reduction Example
• Local copy of sum for each thread
• All local copies of sum added together and
stored in “global” variable
#pragma omp parallel for reduction(+:sum)
for(i=0; i<N; i++) {
sum += a[i] * b[i];
}
Introduction to OpenMP
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
10
20
40
80
160
320
640
1280
2560
5120
1 2 4 8 16 32 64 128
Runtimeinseconds
Number of nodes
1 PPN
1 PPN / 2 TPP
1 PPN / 4 TPP
1 PPN / 8 TPP
2 PPN
2 PPN / 2 TPP
2 PPN / 4 TPP
4 PPN
4 PPN / 2 TPP
8 PPN
Why Hybrid Programming?
OpenMP/MPI
PPN = processes per node
TPP = threads per process
53% improvement
over MPI
Simulation of Free-Surface Flows, Finite Element CFD solver written in Fortran and C
Figure kindly provided by HPC group of the Center of Computing and Communication, RWTH Aachen,
Germany
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
The Good, the Bad, and the Ugly
The Good
• OpenMP and MPI blend well with each other if certain rules are respected
by programmers.
The Bad
• Programmers need to be aware of the issues of hybrid programming, e.g.
using thread-safe libraries and MPI.
The Ugly
• What’s the best setting for PPN and TPP for a given machine?
MPI and OpenMP hybrid programs can greatly
improve performance of parallel codes !
43
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
44
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND
INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,
COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, Xeon Phi, VTune, and Cilk are trademarks
of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
45
8/21/201
Intel Confidential - Use under NDA only
45

More Related Content

What's hot

Scaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanovScaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanovDenis Nagorny
 
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationclCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationIntel® Software
 
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...Intel® Software
 
Intel Knights Landing Slides
Intel Knights Landing SlidesIntel Knights Landing Slides
Intel Knights Landing SlidesRonen Mendezitsky
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataDESMOND YUEN
 
Relative Capacity por Eduardo Oliveira e Joseph Temple
Relative Capacity por Eduardo Oliveira e Joseph TempleRelative Capacity por Eduardo Oliveira e Joseph Temple
Relative Capacity por Eduardo Oliveira e Joseph TempleJoao Galdino Mello de Souza
 
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYehMAKERPRO.cc
 
5 pipeline arch_rationale
5 pipeline arch_rationale5 pipeline arch_rationale
5 pipeline arch_rationalevideos
 
Chapter 07 pam 2o-p
Chapter 07 pam 2o-pChapter 07 pam 2o-p
Chapter 07 pam 2o-pIIUI
 
6 profiling tools
6 profiling tools6 profiling tools
6 profiling toolsvideos
 
Chapter 07 pam 3o-p
Chapter 07 pam 3o-pChapter 07 pam 3o-p
Chapter 07 pam 3o-pIIUI
 
Glossary of terms (assignment...)
Glossary of terms (assignment...)Glossary of terms (assignment...)
Glossary of terms (assignment...)gordonpj96
 
Computer Fundamentals Chapter 07 pam
Computer Fundamentals Chapter  07 pamComputer Fundamentals Chapter  07 pam
Computer Fundamentals Chapter 07 pamSaumya Sahu
 
Overview of Intel® Omni-Path Architecture
Overview of Intel® Omni-Path ArchitectureOverview of Intel® Omni-Path Architecture
Overview of Intel® Omni-Path ArchitectureIntel® Software
 
Sybsc cs sem 3 physical computing and iot programming unit 1
Sybsc cs sem 3 physical computing and iot programming unit 1Sybsc cs sem 3 physical computing and iot programming unit 1
Sybsc cs sem 3 physical computing and iot programming unit 1WE-IT TUTORIALS
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Intel® Software
 

What's hot (20)

Scaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanovScaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanov
 
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationclCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
 
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...
 
Intel Knights Landing Slides
Intel Knights Landing SlidesIntel Knights Landing Slides
Intel Knights Landing Slides
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big Data
 
Intel python 2017
Intel python 2017Intel python 2017
Intel python 2017
 
Relative Capacity por Eduardo Oliveira e Joseph Temple
Relative Capacity por Eduardo Oliveira e Joseph TempleRelative Capacity por Eduardo Oliveira e Joseph Temple
Relative Capacity por Eduardo Oliveira e Joseph Temple
 
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
 
Chapter 10 cs
Chapter 10 csChapter 10 cs
Chapter 10 cs
 
5 pipeline arch_rationale
5 pipeline arch_rationale5 pipeline arch_rationale
5 pipeline arch_rationale
 
Intel Core i7 Processors
Intel Core i7 ProcessorsIntel Core i7 Processors
Intel Core i7 Processors
 
Chapter 07 pam 2o-p
Chapter 07 pam 2o-pChapter 07 pam 2o-p
Chapter 07 pam 2o-p
 
Intel Developer Program
Intel Developer ProgramIntel Developer Program
Intel Developer Program
 
6 profiling tools
6 profiling tools6 profiling tools
6 profiling tools
 
Chapter 07 pam 3o-p
Chapter 07 pam 3o-pChapter 07 pam 3o-p
Chapter 07 pam 3o-p
 
Glossary of terms (assignment...)
Glossary of terms (assignment...)Glossary of terms (assignment...)
Glossary of terms (assignment...)
 
Computer Fundamentals Chapter 07 pam
Computer Fundamentals Chapter  07 pamComputer Fundamentals Chapter  07 pam
Computer Fundamentals Chapter 07 pam
 
Overview of Intel® Omni-Path Architecture
Overview of Intel® Omni-Path ArchitectureOverview of Intel® Omni-Path Architecture
Overview of Intel® Omni-Path Architecture
 
Sybsc cs sem 3 physical computing and iot programming unit 1
Sybsc cs sem 3 physical computing and iot programming unit 1Sybsc cs sem 3 physical computing and iot programming unit 1
Sybsc cs sem 3 physical computing and iot programming unit 1
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 

Similar to Intel® MPI Library e OpenMP* - Intel Software Conference 2013

Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup WorkshopOfer Rosenberg
 
NFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function FrameworkNFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function FrameworkMichelle Holley
 
In The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for IntelIn The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for IntelIntel® Software
 
Introduction ciot workshop premeetup
Introduction ciot workshop premeetupIntroduction ciot workshop premeetup
Introduction ciot workshop premeetupBeMyApp
 
More explosions, more chaos, and definitely more blowing stuff up
More explosions, more chaos, and definitely more blowing stuff upMore explosions, more chaos, and definitely more blowing stuff up
More explosions, more chaos, and definitely more blowing stuff upIntel® Software
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Intel® Software
 
Make your unity game faster, faster
Make your unity game faster, fasterMake your unity game faster, faster
Make your unity game faster, fasterIntel® Software
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?Michelle Holley
 
Linux on Z13 and Simulatenus Multithreading - Sebastien Llaurency
Linux on Z13 and Simulatenus Multithreading - Sebastien LlaurencyLinux on Z13 and Simulatenus Multithreading - Sebastien Llaurency
Linux on Z13 and Simulatenus Multithreading - Sebastien LlaurencyNRB
 
Droidcon2013 ndk cpu_architecture_optimization_weggerle_intel
Droidcon2013 ndk cpu_architecture_optimization_weggerle_intelDroidcon2013 ndk cpu_architecture_optimization_weggerle_intel
Droidcon2013 ndk cpu_architecture_optimization_weggerle_intelDroidcon Berlin
 
Droidcon ndk cpu_architecture_optimization
Droidcon ndk cpu_architecture_optimizationDroidcon ndk cpu_architecture_optimization
Droidcon ndk cpu_architecture_optimizationDroidcon Berlin
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryDataWorks Summit
 
Intel XDK - Philly JS
Intel XDK - Philly JSIntel XDK - Philly JS
Intel XDK - Philly JSIan Maffett
 
Using JavaScript to Build HTML5 Tools (Ian Maffett)
Using JavaScript to Build HTML5 Tools (Ian Maffett)Using JavaScript to Build HTML5 Tools (Ian Maffett)
Using JavaScript to Build HTML5 Tools (Ian Maffett)Future Insights
 
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...Media Gorod
 
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Alluxio, Inc.
 
17294_HiperSockets.pdf
17294_HiperSockets.pdf17294_HiperSockets.pdf
17294_HiperSockets.pdfEeszt
 
Efficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsEfficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsGael Hofemeier
 
How to create a high quality, fast texture compressor using ISPC
How to create a high quality, fast texture compressor using ISPC How to create a high quality, fast texture compressor using ISPC
How to create a high quality, fast texture compressor using ISPC Gael Hofemeier
 
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Pradeep Singh
 

Similar to Intel® MPI Library e OpenMP* - Intel Software Conference 2013 (20)

Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
 
NFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function FrameworkNFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function Framework
 
In The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for IntelIn The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for Intel
 
Introduction ciot workshop premeetup
Introduction ciot workshop premeetupIntroduction ciot workshop premeetup
Introduction ciot workshop premeetup
 
More explosions, more chaos, and definitely more blowing stuff up
More explosions, more chaos, and definitely more blowing stuff upMore explosions, more chaos, and definitely more blowing stuff up
More explosions, more chaos, and definitely more blowing stuff up
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 
Make your unity game faster, faster
Make your unity game faster, fasterMake your unity game faster, faster
Make your unity game faster, faster
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?
 
Linux on Z13 and Simulatenus Multithreading - Sebastien Llaurency
Linux on Z13 and Simulatenus Multithreading - Sebastien LlaurencyLinux on Z13 and Simulatenus Multithreading - Sebastien Llaurency
Linux on Z13 and Simulatenus Multithreading - Sebastien Llaurency
 
Droidcon2013 ndk cpu_architecture_optimization_weggerle_intel
Droidcon2013 ndk cpu_architecture_optimization_weggerle_intelDroidcon2013 ndk cpu_architecture_optimization_weggerle_intel
Droidcon2013 ndk cpu_architecture_optimization_weggerle_intel
 
Droidcon ndk cpu_architecture_optimization
Droidcon ndk cpu_architecture_optimizationDroidcon ndk cpu_architecture_optimization
Droidcon ndk cpu_architecture_optimization
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
 
Intel XDK - Philly JS
Intel XDK - Philly JSIntel XDK - Philly JS
Intel XDK - Philly JS
 
Using JavaScript to Build HTML5 Tools (Ian Maffett)
Using JavaScript to Build HTML5 Tools (Ian Maffett)Using JavaScript to Build HTML5 Tools (Ian Maffett)
Using JavaScript to Build HTML5 Tools (Ian Maffett)
 
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
 
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
 
17294_HiperSockets.pdf
17294_HiperSockets.pdf17294_HiperSockets.pdf
17294_HiperSockets.pdf
 
Efficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsEfficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® Graphics
 
How to create a high quality, fast texture compressor using ISPC
How to create a high quality, fast texture compressor using ISPC How to create a high quality, fast texture compressor using ISPC
How to create a high quality, fast texture compressor using ISPC
 
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
 

More from Intel Software Brasil

Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™  Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™ Intel Software Brasil
 
Escreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatEscreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatIntel Software Brasil
 
Desafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaDesafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaIntel Software Brasil
 
Desafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaDesafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaIntel Software Brasil
 
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XEGetting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XEIntel Software Brasil
 
Principais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaPrincipais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaIntel Software Brasil
 
Principais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoPrincipais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoIntel Software Brasil
 
Benchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoBenchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoIntel Software Brasil
 
Yocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoYocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoIntel Software Brasil
 
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...Intel Software Brasil
 
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Intel Software Brasil
 
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoO uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoIntel Software Brasil
 
Escreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw DayEscreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw DayIntel Software Brasil
 

More from Intel Software Brasil (20)

Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™  Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™
 
Escreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatEscreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKat
 
Desafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaDesafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento Multiplataforma
 
Desafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaDesafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataforma
 
Yocto - 7 masters
Yocto - 7 mastersYocto - 7 masters
Yocto - 7 masters
 
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XEGetting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
 
Intel tools to optimize HPC systems
Intel tools to optimize HPC systemsIntel tools to optimize HPC systems
Intel tools to optimize HPC systems
 
Principais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaPrincipais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralela
 
Principais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoPrincipais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorização
 
Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
Benchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoBenchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenho
 
Yocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoYocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/Vivo
 
Html5 fisl15
Html5 fisl15Html5 fisl15
Html5 fisl15
 
IoT FISL15
IoT FISL15IoT FISL15
IoT FISL15
 
IoT TDC Floripa 2014
IoT TDC Floripa 2014IoT TDC Floripa 2014
IoT TDC Floripa 2014
 
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
 
Html5 tdc floripa_2014
Html5 tdc floripa_2014Html5 tdc floripa_2014
Html5 tdc floripa_2014
 
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
 
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoO uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
 
Escreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw DayEscreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw Day
 

Recently uploaded

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Intel® MPI Library e OpenMP* - Intel Software Conference 2013

  • 1. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. MPI and OpenMP Reducing effort for parallel software development August, 2013 1 Werner Krotz-Vogel
  • 2. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © 2009 Mathew J. Sottile, Timothy G. Mattson, and Craig E 2 Objectives • Design parallel applications from serial codes • Determine appropriate decomposition strategies for applications • Choose applicable parallel model for implementation • MPI and OpenMP
  • 3. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Why MPI and OpenMP ? • Performance ~ Die Area - 4x the Silicon Die area gives 2x the performance in one core, but 4x the performance when dedicated to 4 cores - Power ~ Voltage2 (voltage is roughly prop. to clock freq.) Conclusion (with respect to above Pollack’s rule) - Multiple cores is a powerful handle to adjust “Performance/Watt”  Parallel Hardware  Parallel Software
  • 4. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 4 Parallel Programming: Algorithms Distributed Versus Shared Memory CPU Memory Bus Memory C P U C P U C P U C P U CPU Memory CPU Memory CPU Memory Network Message Passing Threads Multiple processes Share data with messages MPI* Single process Concurrent execution Shared memory and resources Explicit threads, OpenMP*
  • 5. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 5 Parallel Programming: Algorithms Designing Parallel Programs •Partition – Divide problem into tasks •Communicate – Determine amount and pattern of communication •Agglomerate – Combine tasks •Map – Assign agglomerated tasks to physical processors The Problem Initial tasks Communication Combined Tasks Final Program
  • 6. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 6 Parallel Programming: Algorithms 1. Partitioning •Discover as much parallelism as possible • Independent computations and/or data • Maximize number of primitive tasks •Functional decomposition • Divide the computation, then associate the data •Domain decomposition • Divide the data into pieces, then associate computation Initial tasks
  • 7. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 7 Parallel Programming: Algorithms Decomposition Methods •Functional decomposition – Focusing on computations can reveal structure in a problem Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, Engineer Research and Development Center (ERDC). Domain decomposition • Focus on largest or most frequently accessed data structure • Data parallelism • Same operation(s) applied to all data Atmosphere Model Ocean Model Land Surface Model Hydrology Model
  • 8. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 8 Parallel Programming: Algorithms 2. Communication •Determine the communication pattern between primitive tasks • What data need to be shared? •Point-to-point • One thread to another •Collective • Groups of threads sharing data •Execution order dependencies are communication Communication
  • 9. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 9 Parallel Programming: Algorithms 3. Agglomeration •Group primitive tasks in order to: • Improve performance/granularity – Localize communication • Put tasks that communicate in the same group – Maintain scalability of design • Gracefully handle changes in data set size or number of processors – Simplify programming and maintenance Combined Tasks
  • 10. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 10 Parallel Programming: Algorithms 4. Mapping •Assign tasks to processors in order to: – Maximize processor utilization – Minimize inter-processor communication •One task or multiple tasks per processor? •Static or dynamic assignment? •Most applicable to message passing – Programmer can map tasks to threads Final Program
  • 11. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 11 Parallel Programming: Algorithms What Is Not Parallel•Subprograms with “state” or with side effects – Pseudo-random number generators – File I/O routines – Output on screen •Loops with data dependencies – Variables written in one iteration and read in another – Quick test: Reverse loop iterations Loop carried – Value carried from one iteration to the next Induction variables – Incremented each trip through loop Reductions – Summation; collapse array to single value Recurrence – Feed information forward
  • 12. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 12 Introduction to MPI What is MPI ? CPU Private Memory CPU Private Memory CPU Private Memory Node 0 Node 1 Node n
  • 13. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 13 Introduction to MPI The Distributed-Memory Model •Characteristics of distributed memory machines • No common address space • High-latency interconnection network • Explicit message exchange
  • 14. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 14 Introduction to MPI Message Passing Interface (MPI) •Depending on the interconnection network, clusters exhibit different interfaces to the network, e.g. • Ethernet: UNIX sockets • InfiniBand: OFED, Verbs •MPI provides an abstraction to these interfaces • Generic communication interface • Logical ranks (no physical addresses) • Supportive functions (e.g. parallel file I/O)
  • 15. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 15 Introduction to MPI “Hello World” in Fortran •program hello •include 'mpif.h‘ •integer mpierr, rank, procs •call MPI_Init(mpierr) •call MPI_Comm_size(MPI_COMM_WORLD, procs, mpierr) •call MPI_Comm_rank(MPI_COMM_WORLD, rank, mpierr) •write (*,*) 'Hello world from ', rank, 'of', procs •call MPI_Finalize(mpierr) •end program hello
  • 16. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 16 Introduction to MPI Compilation and Execution •MPI implementations ship with a compiler wrapper: • mpiicc –o helloc hello.c • mpiifort –o hellof hello.f •Wrapper correctly calls native C/Fortran compiler and passes along MPI specifics (e.g. library) •Wrappers usually accept the same compiler options as the underlying native compiler, e.g. • mpiicc –O2 –fast –o module.o –c module.c
  • 17. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 17 Introduction to MPI Compilation and Execution •To run the “Hello World”, use: • mpirun –np 8 helloc •It provides portable, transparent application start-up • connect to the cluster nodes for execution • launch processes on the nodes • pass along information how to reach others •When mpirun returns, execution was completed. •Note: mpirun is implementation-specific
  • 18. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 18 Introduction to MPI Output of “Hello World” • Hello world from 0 of 8 • Hello world from 1 of 8 • Hello world from 4 of 8 • Hello world from 6 of 8 • Hello world from 5 of 8 • Hello world from 7 of 8 • Hello world from 2 of 8 • Hello world from 3 of 8 No particular ordering of process execution! If needed, programmer must ensure ordering by explicit comm’.
  • 19. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 19 Introduction to MPI Sending Messages (Blocking) • subroutine master(array, length) • include 'mpif.h' • double precision array(1) • integer length • double precision sum, globalsum • integer rank, procs, mpierr, size • call MPI_Comm_size(MPI_COMM_WORLD, procs, mpierr) • size = length / procs • do rank = 1,procs-1 • call MPI_Send(size, 1, MPI_INTEGER, rank, 0, • & MPI_COMM_WORLD, mpierr) • call MPI_Send(array(rank*size+1:rank*size+size), size, • & MPI_DOUBLE_PRECISION, rank, 1, MPI_COMM_WORLD, mpierr) • enddo Example only correct, iff length is a multiple of procs.
  • 20. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 20 Introduction to MPI MPI_Send •int MPI_Send(void* buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm) •MPI_SEND(BUF, COUNT, DTYPE, DEST, TAG, COMM,IERR) <type> BUF(*) INTEGER COUNT, DTYPE, DEST, TAG, COMM, IERR •Blocking message delivery • blocks until receiver has completely received the message • effectively synchronizes sender and receiver
  • 21. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 21 Introduction to MPI MPI_Send buf Pointer to message data (e.g. pointer to first element of an array) count Length of the message in elements dtype Data type of the message content (size of data type x count = message size) dest Rank of the destination process tag “Type” of the message comm Handle to the communication group ierr Fortran: OUT argument for error code return value C/C++: error code
  • 22. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 22 Introduction to MPI MPI Data Type C Data Type MPI_BYTE MPI_CHAR signed char MPI_DOUBLE double MPI_FLOAT float MPI_INT int MPI_LONG long MPI_LONG_DOUBLE long double MPI_PACKED MPI_SHORT short MPI_UNSIGNED_SHORT unsigned short MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long MPI_UNSIGNED_CHAR unsigned char MPI provides predefined data types that must be specified when passing messages. MPI Data Types for C
  • 23. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 23 Introduction to MPI Communication Wildcards •MPI defines a set of wildcards to be specified with communication primitives: MPI_ANY_SOURCE Matches any logical rank when receiving a message with MPI_Recv (message status contains actual sender) MPI_ANY_TAG Matches any message tag when receiving a message (message status contains actual tag) MPI_PROC_NULL Special value indicating non-existent process rank (messages are not delivered or received for this special rank)
  • 24. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 24 Introduction to MPI Blocking Communication •MPI_Send and MPI_Recv are blocking operations MPI_Send MPI_Recv Computation Communication Process A Process B
  • 25. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 25 Introduction to MPI Non-blocking Communication •MPI_Isend and MPI_Irecv are blocking operations MPI_Isend MPI_Irecv Computation Communication Process A Process B MPI_Wait MPI_Wait
  • 26. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 26 Introduction to MPI ‘Collectives’, e.g. MPI_Reduce •int MPI_Reduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype dtype, MPI_Op op, int root, MPI_Comm comm) •MPI_REDUCE(SENDBUF, RECVBUF, COUNT, DTYPE, OP, ROOT, COMM, IERR) <type> SENDBUF(*), RECVBUF(*) INTEGER COUNT, DTYPE, OP, ROOT, COMM, IERR •Global operation that accumulates data at the processors into a global result at the root process. • All processes have to reach the same MPI_Reduce invocation. • Otherwise deadlocks and undefined behavior may occur.
  • 27. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 27 Introduction to MPI MPI_Reduce – Operators MPI_MAX maximum MPI_MIN minimum MPI_SUM sum MPI_PROD product MPI_LAND / MPI_BAND logical and / bit-wise and MPI_LOR / MPI_BOR logical or / bit-wise or MPI_LXOR MPI_BXOR logical excl. or / bit-wise excl. or MPI_MAXLOC max value and location MPI_MINLOC min value and location
  • 28. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 28 Introduction to MPI MPI _Barrier •int MPI_Barrier(MPI_Comm comm ) •MPI_BARRIER(COMM, IERROR) INTEGER COMM, IERROR •Global operation that synchronizes all participating processes. • All processes have to reach an MPI_Barrier invocation. • Otherwise deadlocks and undefined behavior may occur.
  • 29. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 29 Introduction to MPI Stencil Computation example •Some algorithms (e.g. Jacobi, Gauss- Seidel) process data in with a stencil: • grid(i,j) = 0.25 * (grid(i+1,j) + grid(i-1,j) + grid(i,j+1) + grid(i,j-1)) •Data access pattern:i-1,j i+1,j i,j+1i,j-1 i,j
  • 30. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 30 Introduction to MPI MPI features not covered • One-sided communication – MPI_Put, MPI_Get – Uses Remote Memory Access (RMA) – Separates communication from synchronization • User-defined datatypes, strided messages • Dynamic process spawning: MPI_Spawn Collective communication can be used across disjoint intra- communicators • Parallel I/O • MPI 3.0 (released Sept 21, 2012)
  • 31. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 31 What Is OpenMP? • Portable, shared-memory threading API –Fortran, C, and C++ –Multi-vendor support for both Linux and Windows • Standardizes task & loop-level parallelism • Supports coarse-grained parallelism • Combines serial and parallel code in single source • Standardizes ~ 20 years of compiler- directed threading experience http://www.openmp.org Current spec is OpenMP 4.0 July 31, 2013 (combined C/C++ and Fortran) Introduction to OpenMP
  • 32. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 32 OpenMP Programming Model Fork-Join Parallelism: • Master thread spawns a team of threads as needed • Parallelism is added incrementally: that is, the sequential program evolves into a parallel program Parallel Regions Master Thread Introduction to OpenMP
  • 33. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 33 A Few Syntax Details to Get Started • Most of the constructs in OpenMP are compiler directives or pragmas – For C and C++, the pragmas take the form: #pragma omp construct [clause [clause]…] – For Fortran, the directives take one of the forms: C$OMP construct [clause [clause]…] !$OMP construct [clause [clause]…] *$OMP construct [clause [clause]…] • Header file or Fortran 90 module #include “omp.h” use omp_lib Introduction to OpenMP
  • 34. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 34 Worksharing • Worksharing is the general term used in OpenMP to describe distribution of work across threads. • Three examples of worksharing in OpenMP are: • omp for construct • omp sections construct • omp task construct Automatically divides work among threads Introduction to OpenMP
  • 35. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 35 ‘omp for’ Construct • Threads are assigned an independent set of iterations • Threads must wait at the end of work-sharing construct #pragma omp parallel #pragma omp for Implicit barrier i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 i = 12 // assume N=12 #pragma omp parallel #pragma omp for for(i = 1, i < N+1, i++) c[i] = a[i] + b[i]; Introduction to OpenMP
  • 36. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 36 New Addition to OpenMP Tasks Main change for OpenMP 3.0 • Allows parallelization of irregular problems • unbounded loops • recursive algorithms • producer/consume Device Constructs Main change for OpenMP 4.0 • Allows to describe regions of code where data and/or computation should be moved to another computing device. Introduction to OpenMP
  • 37. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 37 What are tasks? • Tasks are independent units of work • Threads are assigned to perform the work of each task – Tasks may be deferred • Tasks may be executed immediately • The runtime system decides which of the above – Tasks are composed of: • code to execute • data environment • internal control variables (ICV) Serial Parallel Introduction to OpenMP
  • 38. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 38 Simple Task Example A pool of 8 threads is created here #pragma omp parallel // assume 8 threads { #pragma omp single private(p) { … while (p) { #pragma omp task { processwork(p); } p = p->next; } } } One thread gets to execute the while loop The single “while loop” thread creates a task for each instance of processwork() Introduction to OpenMP
  • 39. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 39 Task Construct – Explicit Task View – A team of threads is created at the omp parallel construct – A single thread is chosen to execute the while loop – lets call this thread “L” – Thread L operates the while loop, creates tasks, and fetches next pointers – Each time L crosses the omp task construct it generates a new task and has a thread assigned to it – Each task runs in its own thread – All tasks complete at the barrier at the end of the parallel region’s single construct #pragma omp parallel { #pragma omp single { // block 1 node * p = head; while (p) { //block 2 #pragma omp task process(p); p = p->next; //block 3 } } } Introduction to OpenMP
  • 40. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 40 OpenMP* Reduction Clause • reduction (op : list) • The variables in “list” must be shared in the enclosing parallel region • Inside parallel or work-sharing construct: • A PRIVATE copy of each list variable is created and initialized depending on the “op” • These copies are updated locally by threads • At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable Introduction to OpenMP
  • 41. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 41 Reduction Example • Local copy of sum for each thread • All local copies of sum added together and stored in “global” variable #pragma omp parallel for reduction(+:sum) for(i=0; i<N; i++) { sum += a[i] * b[i]; } Introduction to OpenMP
  • 42. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 10 20 40 80 160 320 640 1280 2560 5120 1 2 4 8 16 32 64 128 Runtimeinseconds Number of nodes 1 PPN 1 PPN / 2 TPP 1 PPN / 4 TPP 1 PPN / 8 TPP 2 PPN 2 PPN / 2 TPP 2 PPN / 4 TPP 4 PPN 4 PPN / 2 TPP 8 PPN Why Hybrid Programming? OpenMP/MPI PPN = processes per node TPP = threads per process 53% improvement over MPI Simulation of Free-Surface Flows, Finite Element CFD solver written in Fortran and C Figure kindly provided by HPC group of the Center of Computing and Communication, RWTH Aachen, Germany
  • 43. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. The Good, the Bad, and the Ugly The Good • OpenMP and MPI blend well with each other if certain rules are respected by programmers. The Bad • Programmers need to be aware of the issues of hybrid programming, e.g. using thread-safe libraries and MPI. The Ugly • What’s the best setting for PPN and TPP for a given machine? MPI and OpenMP hybrid programs can greatly improve performance of parallel codes ! 43
  • 44. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 44
  • 45. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, Xeon Phi, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Legal Disclaimer & Optimization Notice Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 45 8/21/201 Intel Confidential - Use under NDA only 45