With the diversity of platforms, it is impossible for MPI libraries to automatically provide the best performance for all existing applications. In this session, we demonstrate that Intel® MPI Library is not a black box and contains several features allowing users to enhance MPI applications. From basic (process mapping, collective tuning) to advanced features (unreliable datagram, kernel-assisted approaches), this session covers a large spectrum of possibilities offered by the Intel MPI Library to improve the performance of parallel applications on high-performance computing (HPC) systems.
This session introduces tuning flags and explains features available on Intel MPI Library by highlighting results obtained on a Stampede* cluster. It is designed to help beginner and intermediate Intel MPI Library users to better understand all of the library's capabilities.
Resources
3. 33
Scalable
Implementa-on
Agnos-c
OFIWG: develop … interfaces aligned
with … application needs
So2ware
interfaces
aligned
with
applica-on
requirements
• Careful
analysis
of
requirement
Expand
open
source
community
• Inclusive
development
effort
• App
and
HW
developers
Good
impedance
match
with
mul-ple
fabric
hardware
• InfiniBand*,
iWarp,
RoCE,
Ethernet,
UDP
offload,
Intel®,
Cray*,
IBM*,
others
Open
Source
Applica-on-‐Centric
libfabric
*
Other
names
and
brands
may
be
claimed
as
the
property
of
others
Op-mized
SW
path
to
HW
• Minimize
cache/memory
footprint
• Reduce
instrucLon
count
• Minimize
memory
accesses
4. 4
OFI APPLICATION REQUIREMENTS
Give us a high-
level interface!
Give us a low-
level interface!
MPI developers
OFI strives to meet
both requirements
5. 5
Fabric
Services
Application
OFI
Provider
Application
OFI
Provider
Provider
opLmizes
for
OFI
features
Common
opLmizaLon
for
all
apps/providers
App
uses
OFI
features
Application
OFI
Provider
App
opLmizes
based
on
supported
features
Provider
supports
low-‐level
features
only
OFI SOFTWARE DEVELOPMENT STRATEGIES
One Size Does Not Fit All
6. OFI DEVELOPMENT STATUS
6
Fabric
Services
Application
libfabric
Provider
Provider optimizes
for OFI features
Common
opLmizaLon
for
all
apps/providers
Provider supports low-
level features only
Many
apps
Few
apps
Provider’s
choice
App
opLmizes
based
on
supported
features
App uses OFI
features
OFI-provider
gap
6
7. OFI LIBFABRIC COMMUNITY
7
*
Other
names
and
brands
may
be
claimed
as
the
property
of
others
libfabric
Intel®
MPI
Library
MPICH
Netmod/CH4
Open
MPI
MTL/BTL
Open
MPI
SHMEM
Sandia
SHMEM
GASNet
Clang
UPC
rsocket
ES-‐API
libfabric
Enabled
Middleware
Control
Services
CommunicaLon
Services
CompleLon
Services
Data
Transfer
Services
Discovery
fi_info
ConnecLon
Management
Address
Vectors
Event
Queues
Event
Counters
Message
Queue
Tag
Matching
RMA
Atomics
Sockets
TCP,
UDP
Verbs
Cisco
usNIC
Intel
OPA
PSM
Cray
GNI
Mellanox
MXM
IBM
Blue
Gene
A3Cube
RONNIE
* * * * *®
experimental
supported
*
Because of the OFI-provider gap,
not all apps work with all providers
8. LIBFABRIC SCALABILITY
8
By Courtesy Argonne* National Laboratory, CC BY 2.0,
https://commons.wikimedia.org/w/index.php?curid=24653857
Developed
to
evaluate
the
Aurora
so_ware
stack
at
scale
and
assist
applicaLons
in
the
transiLon
from
Mira
to
Aurora
NaLve
provider
implementaLon
that
directly
uses
the
Blue
Gene/Q
hardware
and
network
interfaces
for
communicaLon
*
Other
names
and
brands
may
be
claimed
as
the
property
of
others
Blue Gene / Q
9. § IBM* MPICH / PAMI
• IBM XL C compiler for BG, v12.1
• Optimized for single-threaded latency
• …/comm/xl.legacy.ndebug/bin/mpicc
• v1r2m2
§ MPICH / CH4 / libfabric
• gcc 4.4.7
• global locks, inline, direct, etc.
• Provider not optimized for performance
PAMI
MPICH
PAMID
hardware
BG/Q
Provider
libfabric
MPICH
CH4
OFI
Completely
subjec.ve
so_ware
stack
comparison
vs
32
nodes
on
ALCF
Vesta
machine
PAMI and libfabric
performance
LIBFABRIC SCALABILITY
9
Blue Gene / Q
10. 10
1
2
4
8
16
1
8
64
512
4096
Latency
(us)
Bytes
IBM
OFI
OSU* MPI
Performance
Tests v5.0
0.1
1
10
100
1000
1
8
64
512
4096
32768
Bandwidth
(MB/s)
Bytes
IBM
OFI
100,000
1,000,000
10,000,000
1
8
64
512
4096
32768
Msgs/s
Bytes
IBM
OFI
MPI scale out testing:
- cpi – 1M ranks,
- ISx benchmark – 0.5M ranks
Tests
document
performance
of
components
on
a
parLcular
test,
in
specific
systems.
Differences
in
hardware,
so_ware,
or
configuraLon
will
affect
actual
performance.
Consult
other
sources
of
informaLon
to
evaluate
performance
as
you
consider
your
purchase.
For
more
complete
informaLon
about
performance
and
benchmark
results,
visit
hkp://www.intel.com/performance.
*
Other
names
and
brands
may
be
claimed
as
the
property
of
others
LIBFABRIC SCALABILITY
Blue Gene / Q
11. LIBFABRIC SCALABILITY
11
Evaluate
libfabric
SHMEM
performance
on
high-‐
performance
interconnect
Provider
implementaLon
that
uses
the
Cray*
uGNI
hardware
and
network
interface
for
communicaLon
*
Other
names
and
brands
may
be
claimed
as
the
property
of
others
Computing Sciences
Lawrence Berkeley National Laboratory
SHMEM
CRAY XC40
12. § Cray* SHMEM
• Cray* Aries, Dragonfly* topology
• CLE (Cray* Linux*), SLURM*
• DMAPP
• Designed for PGAS
• Optimized for small messages
§ Sandia* OpenSHMEM / libfabric
• uGNI
• Designed for MPI and PGAS
• Optimized for large messages
§ https://www.nersc.gov/users/computational-systems/cori/
configuration
DMAPP
Cray
SHMEM
Aries
Interconnect
uGNI
libfabric
Open
SHMEM
OFI
vs
1630
nodes
on
Cray*
XC40
(Cori)
LIBFABRIC SCALABILITY
12
*
Other
names
and
brands
may
be
claimed
as
the
property
of
others
SHMEM
CRAY XC40
13. 13
Tests
document
performance
of
components
on
a
parLcular
test,
in
specific
systems.
Differences
in
hardware,
so_ware,
or
configuraLon
will
affect
actual
performance.
Consult
other
sources
of
informaLon
to
evaluate
performance
as
you
consider
your
purchase.
For
more
complete
informaLon
about
performance
and
benchmark
results,
visit
hkp://www.intel.com/performance.
LIBFABRIC SCALABILITY
*
Other
names
and
brands
may
be
claimed
as
the
property
of
others
Put – up to 61% improvement
Get – within 2%
Blocking Get/Put B/W
SHMEM
CRAY XC40
14. 14
Tests
document
performance
of
components
on
a
parLcular
test,
in
specific
systems.
Differences
in
hardware,
so_ware,
or
configuraLon
will
affect
actual
performance.
Consult
other
sources
of
informaLon
to
evaluate
performance
as
you
consider
your
purchase.
For
more
complete
informaLon
about
performance
and
benchmark
results,
visit
hkp://www.intel.com/performance.
*
Other
names
and
brands
may
be
claimed
as
the
property
of
others
XPMEM
Improved scalability
GUPS Scaling
slight improvement
(lower is better)
LIBFABRIC SCALABILITY
NAS ISx (Integer Sort)
weak scaling
SHMEM
CRAY XC40
15. ADDRESSING THE OFI-PROVIDER GAP
15
Libfabric Framework
libfabric
API
Components
templates,
lists,
rbtree,
hash
table,
free
pool,
ring
buffer,
stack,
…
Base
Class
Implementa-ons
fabric,
domain,
EQ,
wait
sets,
AV,
CQ,
…
SHM
primiLves
Provider
Services
• Logging
• Environment
variables
U-lity
Provider
Core
Provider
Interface
‘extensions’
–
for
consistency
Assist
in
provider
development
Enhance
core
provider
17. MOVING FORWARD
17
Beyond
HPC
Enterprise,
Cloud,
Storage
(NVM)
Stronger
engagement
with
these
communiLes
Beyond
Linux*
Sockets
–
TCP/UDP
NetworkDirect
Analyze requests
to expand OFI
community
*
Other
names
and
brands
may
be
claimed
as
the
property
of
others
18. TARGET SCHEDULE
18
§ Driven
by
implementaLon
feedback
§ Improve
error
handling,
flow
control
§ Beker
support
for
non-‐tradiLonal
fabrics
§ OpLmize
compleLon
handling
§ Address
deferred
features
2016
Q2
Q3
Q4
2017
Q2
Q3
Q4
RDM
over
DGRAM
ULl
RDM
over
MSG
ULl
Shared
Memory
New
Core
Providers
ABI
1.1
ULlity
provider
is
ongoing
TradiLonal
and
non-‐tradiLonal
RDMA
providers
19. SUMMARY
19
§ OFIWG development model
working well
§ Interest in OFI and libfabric is
high
§ Growing community
§ Significant effort being made to
simplify the lives of developers
• Applications and providers
OFI
is
so
good