The document discusses several challenges in designing HPC runtimes for exascale systems, including energy awareness, accelerators, and virtualization. It focuses on the MVAPICH2 project which addresses these challenges. MVAPICH2 provides integrated support for GPUs and MICs, virtualization using SR-IOV and containers, and energy awareness. It also achieves high performance for GPU-aware MPI using features like GPUDirect RDMA. Application tests with HOOMD-blue and COSMO show improvements from MVAPICH2's GPU support.
2. HPCAC-Switzerland (Mar ‘16) 2 Network Based Compu5ng Laboratory
• Scalability for million to billion processors
• CollecDve communicaDon
• Unified RunDme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI +
UPC, CAF, …)
• InfiniBand Network Analysis and Monitoring (INAM)
• Integrated Support for GPGPUs
• Integrated Support for MICs
• VirtualizaDon (SR-IOV and Containers)
• Energy-Awareness
• Best PracDce: Set of Tunings for Common ApplicaDons
Overview of A Few Challenges being Addressed by the MVAPICH2
Project for Exascale
3. HPCAC-Switzerland (Mar ‘16) 3 Network Based Compu5ng Laboratory
• Integrated Support for GPGPUs
– CUDA-Aware MPI
– GPUDirect RDMA (GDR) Support
– CUDA-aware Non-blocking CollecDves
– Support for Managed Memory
– Efficient datatype Processing
– SupporDng Streaming applicaDons with GDR
– Efficient Deep Learning with MVAPICH2-GDR
• Integrated Support for MICs
• VirtualizaDon (SR-IOV and Containers)
• Energy-Awareness
• Best PracDce: Set of Tunings for Common ApplicaDons
Overview of A Few Challenges being Addressed by the MVAPICH2
Project for Exascale
5. HPCAC-Switzerland (Mar ‘16) 5 Network Based Compu5ng Laboratory
PCIe
GPU
CPU
NIC
Switch
At Sender:
for (j = 0; j < pipeline_len; j++)
cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …);
for (j = 0; j < pipeline_len; j++) {
while (result != cudaSucess) {
result = cudaStreamQuery(…);
if(j > 0) MPI_Test(…);
}
MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);
}
MPI_Waitall();
<<Similar at receiver>>
• Pipelining at user level with non-blocking MPI and CUDA interfaces
Low Produc,vity and High Performance
MPI + CUDA - Advanced
6. HPCAC-Switzerland (Mar ‘16) 6 Network Based Compu5ng Laboratory
At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Produc,vity
MPI_Send(s_devbuf, size, …);
GPU-Aware MPI Library: MVAPICH2-GPU
7. HPCAC-Switzerland (Mar ‘16) 7 Network Based Compu5ng Laboratory
• OFED with support for GPUDirect RDMA is
developed by NVIDIA and Mellanox
• OSU has a design of MVAPICH2 using
GPUDirect RDMA
– Hybrid design using GPU-Direct RDMA
• GPUDirect RDMA and Host-based pipelining
• Alleviates P2P bandwidth bo<lenecks on SandyBridge and
IvyBridge
– Support for communicaDon using mulD-rail
– Support for Mellanox Connect-IB and ConnectX VPI adapters
– Support for RoCE with Mellanox ConnectX VPI adapters
GPU-Direct RDMA (GDR) with CUDA
IB Adapter
System
Memory
GPU
Memory
GPU
CPU
Chipset
P2P write: 5.2 GB/s
P2P read: < 1.0 GB/s
SNB E5-2670
P2P write: 6.4 GB/s
P2P read: 3.5 GB/s
IVB E5-2680V2
SNB E5-2670 /
IVB E5-2680V2
10. HPCAC-Switzerland (Mar ‘16) 10 Network Based Compu5ng Laboratory
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
• HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768
MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768
MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Applica5on-Level Evalua5on (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Average Time Steps per
second (TPS)
Number of Processes
MV2 MV2+GDR
0
500
1000
1500
2000
2500
3000
3500
4 8 16 32
Average Time Steps per
second (TPS)
Number of Processes
64K Par5cles 256K Par5cles
2X
2X
11. HPCAC-Switzerland (Mar ‘16) 11 Network Based Compu5ng Laboratory
0
20
40
60
80
100
120
4K 16K 64K 256K 1M
Overlap (%)
Message Size (Bytes)
Medium/Large Message Overlap
(64 GPU nodes)
Ialltoall (1process/node)
Ialltoall (2process/node; 1process/GPU)
0
20
40
60
80
100
120
4K 16K 64K 256K 1M
Overlap (%)
Message Size (Bytes)
Medium/Large Message Overlap
(64 GPU nodes)
Igather (1process/node)
Igather (2processes/node; 1process/
GPU)
Plagorm: Wilkes: Intel Ivy Bridge
NVIDIA Tesla K20c + Mellanox Connect-IB
Available since MVAPICH2-GDR 2.2a
CUDA-Aware Non-Blocking Collec5ves
A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU
Collec5ves using CORE-Direct and CUDA Capabili5es on IB Clusters, HIPC,
2015
12. HPCAC-Switzerland (Mar ‘16) 12 Network Based Compu5ng Laboratory
Communica5on Run5me with GPU Managed Memory
● CUDA 6.0 NVIDIA introduced CUDA Managed (or Unified)
memory allowing a common memory allocaDon for GPU
or CPU through cudaMallocManaged() call
● Significant producDvity benefits due to abstracDon of
explicit allocaDon and cudaMemcpy()
● Extended MVAPICH2 to perform communicaDons directly
from managed buffers (Available in MVAPICH2-GDR 2.2b)
● OSU Micro-benchmarks extended to evaluate the
performance of point-to-point and collecDve
communicaDons using managed buffers
● Available in OMB 5.2
D S Banerjee, K Hamidouche, DK Panda, Designing High Performance
Communica,on Run,me for GPUManaged Memory: Early Experiences at
GPGPU-9 Workshop held in conjunc5on with PPoPP 2016. Barcelona Spain
0
5
10
15
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
Latency (us)
Message Size (Bytes)
Latency
H-H MH-MH
0
2000
4000
6000
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
Bandwidth (MB/s)
Message Size (Bytes)
Bandwidth
D-D MD-MD
14. HPCAC-Switzerland (Mar ‘16) 14 Network Based Compu5ng Laboratory
Applica5on-Level Evalua5on (HaloExchange - Cosmo)
0
0.5
1
1.5
16 32 64 96
Normalized Execu5on Time
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
0
0.5
1
1.5
4 8 16 32
Normalized Execu5on Time
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes
• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploi5ng Maximal Overlap for Non-
Con5guous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16
16. HPCAC-Switzerland (Mar ‘16) 16 Network Based Compu5ng Laboratory
SGL-based design for Efficient Broadcast Opera5on on GPU Systems
• Current design is limited by the expensive copies
from/to GPUs
• Proposed several alternaDve designs to avoid the
overhead of the copy
• Loopback, GDRCOPY and hybrid
• High performance and scalability
• SDll uses PCI resources for Host-GPU copies
• Proposed SGL-based design
• Combines IB MCAST and GPUDirect RDMA features
• High performance and scalability for D-D broadcast
• Direct code path between HCA and GPU
• Free PCI resources
• 3X improvement in latency
3X
A. Venkatesh , H. Subramoni , K. Hamidouche , and D. K. Panda, A High Performance Broadcast Design with Hardware Mul5cast and
GPUDirect RDMA for Streaming Applica5ons on InfiniBand Clusters , IEEE Int’l Conf. on High Performance Compu5ng (HiPC ’14)
24. HPCAC-Switzerland (Mar ‘16) 24 Network Based Compu5ng Laboratory
• VirtualizaDon has many benefits
– Fault-tolerance
– Job migraDon
– CompacDon
• Have not been very popular in HPC due to overhead associated with
VirtualizaDon
• New SR-IOV (Single Root – IO VirtualizaDon) support available with
Mellanox InfiniBand adapters changes the field
• Enhanced MVAPICH2 support for SR-IOV
• MVAPICH2-Virt 2.1 (with and without OpenStack) is publicly available
• How about the Containers support?
Can HPC and Virtualiza5on be Combined?
J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applica5ons on SR-IOV based Virtualized InfiniBand
Clusters? EuroPar'14
J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14
J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15
25. HPCAC-Switzerland (Mar ‘16) 25 Network Based Compu5ng Laboratory
• Redesign MVAPICH2 to make it
virtual machine aware
– SR-IOV shows near to naDve
performance for inter-node point to
point communicaDon
– IVSHMEM offers zero-copy access to
data on shared memory of co-resident
VMs
– Locality Detector: maintains the locality
informaDon of co-resident virtual machines
– CommunicaDon Coordinator: selects the
communicaDon channel (SR-IOV, IVSHMEM)
adapDvely
Overview of MVAPICH2-Virt with SR-IOV and IVSHMEM
Host Environment
Guest 1
Hypervisor PF Driver
Infiniband Adapter
Physical
Function
user space
kernel space
MPI
proc
PCI
Device
VF
Driver
Guest 2
user space
kernel space
MPI
proc
PCI
Device
VF
Driver
Virtual
Function
Virtual
Function
/dev/shm/
IV-SHM
IV-Shmem Channel
SR-IOV Channel
J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM
Shmem Benefit MPI ApplicaDons on SR-IOV based
Virtualized InfiniBand Clusters? Euro-Par, 2014.
J. Zhang, X. Lu, J. Jose, R. Shi, M. Li, D. K. Panda. High
Performance MPI Library over SR-IOV Enabled InfiniBand
Clusters. HiPC, 2014.
26. HPCAC-Switzerland (Mar ‘16) 26 Network Based Compu5ng Laboratory
Nova
Glance
Neutron
Swift
Keystone
Cinder
Heat
Ceilometer
Horizon
VM
Backup
volumes in
Stores
images in
Provides
images
Provides
Network
Provisions
Provides
Volumes
Monitors
Provides
UI
Provides
Auth for
Orchestrates
cloud
• OpenStack is one of the most popular
open-source soluDons to build clouds and
manage virtual machines
• Deployment with OpenStack
– SupporDng SR-IOV configuraDon
– SupporDng IVSHMEM configuraDon
– Virtual Machine aware design of MVAPICH2 with
SR-IOV
• An efficient approach to build HPC Clouds
with MVAPICH2-Virt and OpenStack
MVAPICH2-Virt with SR-IOV and IVSHMEM over OpenStack
J. Zhang, X. Lu, M. Arnold, D. K. Panda. MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build
HPC Clouds. CCGrid, 2015.
27. HPCAC-Switzerland (Mar ‘16) 27 Network Based Compu5ng Laboratory
0
50
100
150
200
250
300
350
400
milc leslie3d pop2 GAPgeofem zeusmp2 lu
Execu5on Time (s)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-NaDve
1%
9.5%
0
1000
2000
3000
4000
5000
6000
22,20 24,10 24,16 24,20 26,10 26,16
Execu5on Time (ms)
Problem Size (Scale, Edgefactor)
MV2-SR-IOV-Def
MV2-SR-IOV-Opt
MV2-NaDve
2%
• 32 VMs, 6 Core/VM
• Compared to NaDve, 2-5% overhead for Graph500 with 128 Procs
• Compared to NaDve, 1-9.5% overhead for SPEC MPI2007 with 128 Procs
Applica5on-Level Performance on Chameleon
SPEC MPI2007Graph500
5%
28. HPCAC-Switzerland (Mar ‘16) 28 Network Based Compu5ng Laboratory
NSF Chameleon Cloud: A Powerful and Flexible
Experimental Instrument
• Large-scale instrument
– TargeDng Big Data, Big Compute, Big Instrument research
– ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network
• Reconfigurable instrument
– Bare metal reconfiguraDon, operated as single instrument, graduated approach for ease-of-use
• Connected instrument
– Workload and Trace Archive
– Partnerships with producDon clouds: CERN, OSDC, Rackspace, Google, and others
– Partnerships with users
• Complementary instrument
– ComplemenDng GENI, Grid’5000, and other testbeds
• Sustainable instrument
– Industry connecDons
h<p://www.chameleoncloud.org/
29. HPCAC-Switzerland (Mar ‘16) 29 Network Based Compu5ng Laboratory
0
2
4
6
8
10
12
14
16
18
1 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k
Latency (us)
Message Size (Bytes)
Container-Def
Container-Opt
NaDve
0
2000
4000
6000
8000
10000
12000
14000
16000
1 2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k
Bandwidth (MBps)
Message Size (Bytes)
Container-Def
Container-Opt
NaDve
• Intra-Node Inter-Container
• Compared to Container-Def, up to 81% and 191% improvement on Latency and BW
• Compared to NaDve, minor overhead on Latency and BW
Containers Support: MVAPICH2 Intra-node Point-to-Point
Performance on Chameleon
81%
191%
30. HPCAC-Switzerland (Mar ‘16) 30 Network Based Compu5ng Laboratory
0
500
1000
1500
2000
2500
3000
3500
4000
22, 16 22, 20 24, 16 24, 20 26, 16 26, 20
Execu5on Time (ms)
Problem Size (Scale, Edgefactor)
Container-Def
Container-Opt
NaDve
0
10
20
30
40
50
60
70
80
90
100
MG.D FT.D EP.D LU.D CG.D
Execu5on Time (s)
Container-Def
Container-Opt
NaDve
• 64 Containers across 16 nodes, pining 4 Cores per Container
• Compared to Container-Def, up to 11% and 16% of execuDon Dme reducDon for NAS and Graph 500
• Compared to NaDve, less than 9 % and 4% overhead for NAS and Graph 500
• Op5mized Container support will be available with the next release of MVAPICH2-Virt
Containers Support: Applica5on-Level Performance on Chameleon
Graph 500NAS
11%
16%
33. HPCAC-Switzerland (Mar ‘16) 33 Network Based Compu5ng Laboratory
• MVAPICH2-EA 2.1 (Energy-Aware)
• A white-box approach
• New Energy-Efficient communicaDon protocols for pt-pt and collecDve operaDons
• Intelligently apply the appropriate Energy saving techniques
• ApplicaDon oblivious energy saving
• OEMT
• A library uDlity to measure energy consumpDon for MPI applicaDons
• Works with all MPI runDmes
• PRELOAD opDon for precompiled applicaDons
• Does not require ROOT permission:
• A safe kernel module to read only a subset of MSRs
Energy-Aware MVAPICH2 & OSU Energy Management Tool
(OEMT)
34. HPCAC-Switzerland (Mar ‘16) 34 Network Based Compu5ng Laboratory
• An energy efficient runDme that
provides energy savings without
applicaDon knowledge
• Uses automaDcally and
transparently the best energy
lever
• Provides guarantees on
maximum degradaDon with
5-41% savings at <= 5%
degradaDon
• PessimisDc MPI applies energy
reducDon lever to each MPI call
MVAPICH2-EA: Applica5on Oblivious Energy-Aware-MPI (EAM)
A Case for Applica5on-Oblivious Energy-Efficient MPI Run5me A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D.
K. Panda, D. Kerbyson, and A. Hoise, Supercompu5ng ‘15, Nov 2015 [Best Student Paper Finalist]
1
35. HPCAC-Switzerland (Mar ‘16) 35 Network Based Compu5ng Laboratory
MPI-3 RMA Energy Savings with Proxy-Applica5ons
0
10
20
30
40
50
60
512 256 128
Seconds
#Processes
Graph500 (Execution Time)
optimistic
pessimistic
EAM-RMA
0
50000
100000
150000
200000
250000
300000
350000
512 256 128
Joules
#Processes
Graph500 (Energy Usage)
optimistic
pessimistic
EAM-RMA
46%
• MPI_Win_fence dominates applicaDon execuDon Dme in graph500
• Between 128 and 512 processes, EAM-RMA yields between 31% and 46% savings with no
degradaDon in execuDon Dme in comparison with the default opDmisDc MPI runDme
36. HPCAC-Switzerland (Mar ‘16) 36 Network Based Compu5ng Laboratory
0
500000
1000000
1500000
2000000
2500000
3000000
512 256 128
Joules
#Processes
SCF (Energy Usage)
optimistic
pessimistic
EAM-RMA
0
100
200
300
400
500
600
512 256 128
Seconds
#Processes
SCF (Execution Time)
optimistic
pessimistic
EAM-RMA
MPI-3 RMA Energy Savings with Proxy-Applica5ons
42%
• SCF (self-consistent field) calculaDon spends nearly 75% total Dme in MPI_Win_unlock call
• With 256 and 512 processes, EAM-RMA yields 42% and 36% savings at 11% degradaDon (close to
permi<ed degradaDon ρ = 10%)
• 128 processes is an excepDon due 2-sided and 1-sided interacDon
• MPI-3 RMA Energy-efficient support will be available in upcoming MVAPICH2-EA release
38. HPCAC-Switzerland (Mar ‘16) 38 Network Based Compu5ng Laboratory
• MPI runDme has many parameters
• Tuning a set of parameters can help you to extract higher performance
• Compiled a list of such contribuDons through the MVAPICH Website
– h<p://mvapich.cse.ohio-state.edu/best_pracDces/
• IniDal list of applicaDons
– Amber
– HoomdBlue
– HPCG
– Lulesh
– MILC
– MiniAMR
– Neuron
– SMG2000
• SoliciDng addiDonal contribuDons, send your results to mvapich-help at cse.ohio-
state.edu. We will link these results with credits to you.
Applica5ons-Level Tuning: Compila5on of Best Prac5ces
39. HPCAC-Switzerland (Mar ‘16) 39 Network Based Compu5ng Laboratory
MVAPICH2 – Plans for Exascale
• Performance and Memory scalability toward 1M cores
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)
• Support for task-based parallelism (UPC++)*
• Enhanced OpDmizaDon for GPU Support and Accelerators
• Taking advantage of advanced features of Mellanox InfiniBand
• On-Demand Paging (ODP)
• Swith-IB2 SHArP
• GID-based support
• Enhanced Inter-node and Intra-node communicaDon schemes for upcoming architectures
• OpenPower*
• OmniPath-PSM2*
• Knights Landing
• Extended topology-aware collecDves
• Extended Energy-aware designs and VirtualizaDon Support
• Extended Support for MPI Tools Interface (as in MPI 3.0)
• Extended Checkpoint-Restart and migraDon support with SCR
• Support for * features will be available in MVAPICH2-2.2 RC1
40. HPCAC-Switzerland (Mar ‘16) 40 Network Based Compu5ng Laboratory
• Exascale systems will be constrained by
– Power
– Memory per core
– Data movement cost
– Faults
• Programming Models and RunDmes for HPC need to be
designed for
– Scalability
– Performance
– Fault-resilience
– Energy-awareness
– Programmability
– ProducDvity
• Highlighted some of the issues and challenges
• Need conDnuous innovaDon on all these fronts
Looking into the Future ….
42. HPCAC-Switzerland (Mar ‘16) 42 Network Based Compu5ng Laboratory
Personnel Acknowledgments
Current Students
– A. AugusDne (M.S.)
– A. Awan (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– N. Islam (Ph.D.)
– M. Li (Ph.D.)
Past Students
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. BunDnas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scien,st
– S. Sur
Current Post-Doc
– J. Lin
– D. Banerjee
Current Programmer
– J. Perkins
Past Post-Docs
– H. Wang
– X. Besseron
– H.-W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekar (Ph.D.)
– K. Kulkarni (M.S.)
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Research Scien,sts Current Senior Research Associate
– H. Subramoni
– X. Lu
Past Programmers
– D. Bureddy
- K. Hamidouche
Current Research Specialist
– M. Arnold