SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
Optimizing GPU to GPU
Communication on Cray XK7
Jeff Larkin
What Amdahl says about GPU communication
If you make your GPU computation infinitely fast, performance
will be bound by your communication.
GPU-2-GPU communication has
Higher latency (additional hop over PCIe)
Lower bandwidth (limited by lowest bandwidth link)
G2G communication cannot be an afterthought when running at
scale.
CPU0
MPI_Send ()
GPU0
CPU1
MPI_Recv()
GPU1
How do GPUs communicate?
MPI
But what
really
happens
here?
cudaMemcpy()
CPU0
MPI_Send()
GPU0
cudaMemcpy()
CPU1
MPI_Recv()
GPU1
Until recently…
Unified Virtual Addressing
No UVA: Multiple Memory Spaces UVA : Single Address Space
System
Memory
CPU GPU
GPU
Memory
PCI-e
0x0000
0xFFFF
0x0000
0xFFFF
System
Memory
CPU GPU
GPU
Memory
PCI-e
0x0000
0xFFFF
Unified Virtual Addressing
One address space for all CPU and GPU memory
Determine physical memory location from a pointer value
Enable libraries to simplify their interfaces (e.g. MPI and
cudaMemcpy)
Supported on Tesla starting with Fermi 64-bit applications on
Linux and Windows TCC
Pinned fabric Buffer
Host Buffer memcpy
GPU Buffer
cudaMemcpy
Pinned fabric BufferPinned fabric Buffer
MPI+CUDA
//MPI rank 0
MPI_Send(s_buf_d,size,…);
//MPI rank n-1
MPI_Recv(r_buf_d,size,…);
With UVA and CUDA-aware MPI
//MPI rank 0
cudaMemcpy(s_buf_h,s_buf_d,size,…);
MPI_Send(s_buf_h,size,…);
//MPI rank n-1
MPI_Recv(r_buf_h,size,…);
cudaMemcpy(r_buf_d,r_buf_h,size,…);
No UVA and regular MPI
CUDA-aware MPI makes MPI+CUDA easier.
CUDA-Aware MPI Libraries May
Use RDMA to completely remove CPU from the picture.
Stage GPU buffers through CPU memory automatically
Pipeline messages
All the programmer needs to know is they’ve passed a GPU
pointer to MPI, the library developer can optimize the rest
GPU
0
GPU
1
GPU
0
GPU
1
CPU
0
CPU
1
GPU
0
GPU
1
GPU-Awareness in Cray MPI
Cray began supporting GPU-awareness in 5.6.3
Functions on XK7, but not optimally performing
Expected to work very well on XC30
Must be explicitly enabled via run-time environment variable
MPICH_RDMA_ENABLED_CUDA
Works with both CUDA and OpenACC
Version 5.6.4 adds a pipelining feature that should help large
messages
Enabled with MPICH_G2G_PIPELINE
OMB Latency
Host-to-Host will always
have the lowest latency
(fewest hops)
Staging through host
memory explicitly adds
significant latency
GPU-aware library is able
to fall in the middle.
Note: 2 nodes on separate
blades.
OMB Bandwidth
Once again, H2H wins out
(probably by a difference of
latency)
Direct RDMA suffers badly
with this benchmark.
Setting
MPICH_G2G_PIPELINE=64
pipelines messages and
opens up more concurrency.
Hand-
Pipelining
Came ~Here
OMB Bandwidth, Varying Pipelines
OMB sends messages in a
window of 64, so that is
naturally optimal.
Counter-intuitive that no
intermediate values seemed
to help.
Additionally tried varying
chunk sizes with no benefit.
Optimizing Performance of a Message
MPI vendors know how to optimize performance for an
interconnect
Different approaches for different message sizes
Multiple algorithms
Unfortunately, on the XK7, this may not be the optimal approach.
MPI Lacks the Ability to Express
Dependencies
One way to negate the cost of
G2G communication is to
overlap with something
computation.
Restructuring order of
computation may allow such
overlapping
Some communication patterns
have a natural concurrency
that isn’t easily exploited.
N
W E
Exchange
Cells With
N,S,E,W
Neighbors.
S
Exploiting Communication Concurrency
This cannot be expressed in GPU-aware MPI today.
Pack
North
D2H
North
Transfer
North
H2D
North
Unpack
North
Pack
East
D2H
East
Transfer
East
H2D
East
Unpack
East
Pack
South
D2H
South
Transfer
South
H2D
South
Unpack
South
Pack
West
D2H
West
Transfer
West
H2D
West
Unpack
West
Concurrent Messaging Pseudocode
do i=N,W
MPI_Irecv(i)
enddo
do i=N,W
packBufferAsync(i)
copyBufferD2HAsync(i)
enddo
while(more to send/recv)
if Irecv completed
copyBufferH2DAsync
unpackBufferAsync()
endif
if D2H completed
MPI_Isend()
endif
done
Talks Related to this Optimization
HOMME - Matt Norman – CUG2012 S3D - John Levesque – CUG2013
Summary
Optimizing kernels will only take you so far as you scale,
communication cannot be an afterthought.
GPU-aware MPI libraries are becoming available
Easier to program
Can optimize performance of individual message transfers
Some communication patterns have a natural concurrency that
can be exploited to make communication “free”, but this takes
additional effort.
Optimizing GPU to GPU Communication on Cray XK7

Más contenido relacionado

La actualidad más candente

Assignment cn tl
Assignment cn tlAssignment cn tl
Assignment cn tl
H K
 
Gdb tutorial-handout
Gdb tutorial-handoutGdb tutorial-handout
Gdb tutorial-handout
Suraj Kumar
 

La actualidad más candente (9)

Assignment cn tl
Assignment cn tlAssignment cn tl
Assignment cn tl
 
Q2.12: Debugging with GDB
Q2.12: Debugging with GDBQ2.12: Debugging with GDB
Q2.12: Debugging with GDB
 
Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
A Baker's dozen of TCP
A Baker's dozen of TCPA Baker's dozen of TCP
A Baker's dozen of TCP
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
 
Usage of GDB
Usage of GDBUsage of GDB
Usage of GDB
 
Gnu debugger
Gnu debuggerGnu debugger
Gnu debugger
 
Gdb tutorial-handout
Gdb tutorial-handoutGdb tutorial-handout
Gdb tutorial-handout
 

Similar a Optimizing GPU to GPU Communication on Cray XK7

Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
mohamedragabslideshare
 
A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...
ChangWoo Min
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
guest40fc7cd
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
vsachde
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Sahil Kaw
 

Similar a Optimizing GPU to GPU Communication on Cray XK7 (20)

Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Presentation1
Presentation1Presentation1
Presentation1
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
 
A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...A Survey on in-a-box parallel computing and its implications on system softwa...
A Survey on in-a-box parallel computing and its implications on system softwa...
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
A DSP technical challange for an FPGA Engineer
A DSP technical challange for an FPGA EngineerA DSP technical challange for an FPGA Engineer
A DSP technical challange for an FPGA Engineer
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
Super Computer
Super ComputerSuper Computer
Super Computer
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 
Memory Bandwidth QoS
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoS
 
Blue gene- IBM's SuperComputer
Blue gene- IBM's SuperComputerBlue gene- IBM's SuperComputer
Blue gene- IBM's SuperComputer
 

Más de Jeff Larkin

Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
Jeff Larkin
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
Jeff Larkin
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
Jeff Larkin
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
Jeff Larkin
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Jeff Larkin
 

Más de Jeff Larkin (17)

Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupBest Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
 
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
Performance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismPerformance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive Parallelism
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Optimizing GPU to GPU Communication on Cray XK7

  • 1. Optimizing GPU to GPU Communication on Cray XK7 Jeff Larkin
  • 2. What Amdahl says about GPU communication If you make your GPU computation infinitely fast, performance will be bound by your communication. GPU-2-GPU communication has Higher latency (additional hop over PCIe) Lower bandwidth (limited by lowest bandwidth link) G2G communication cannot be an afterthought when running at scale.
  • 3. CPU0 MPI_Send () GPU0 CPU1 MPI_Recv() GPU1 How do GPUs communicate? MPI But what really happens here?
  • 5. Unified Virtual Addressing No UVA: Multiple Memory Spaces UVA : Single Address Space System Memory CPU GPU GPU Memory PCI-e 0x0000 0xFFFF 0x0000 0xFFFF System Memory CPU GPU GPU Memory PCI-e 0x0000 0xFFFF
  • 6. Unified Virtual Addressing One address space for all CPU and GPU memory Determine physical memory location from a pointer value Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy) Supported on Tesla starting with Fermi 64-bit applications on Linux and Windows TCC Pinned fabric Buffer Host Buffer memcpy GPU Buffer cudaMemcpy Pinned fabric BufferPinned fabric Buffer
  • 7. MPI+CUDA //MPI rank 0 MPI_Send(s_buf_d,size,…); //MPI rank n-1 MPI_Recv(r_buf_d,size,…); With UVA and CUDA-aware MPI //MPI rank 0 cudaMemcpy(s_buf_h,s_buf_d,size,…); MPI_Send(s_buf_h,size,…); //MPI rank n-1 MPI_Recv(r_buf_h,size,…); cudaMemcpy(r_buf_d,r_buf_h,size,…); No UVA and regular MPI CUDA-aware MPI makes MPI+CUDA easier.
  • 8. CUDA-Aware MPI Libraries May Use RDMA to completely remove CPU from the picture. Stage GPU buffers through CPU memory automatically Pipeline messages All the programmer needs to know is they’ve passed a GPU pointer to MPI, the library developer can optimize the rest GPU 0 GPU 1 GPU 0 GPU 1 CPU 0 CPU 1 GPU 0 GPU 1
  • 9. GPU-Awareness in Cray MPI Cray began supporting GPU-awareness in 5.6.3 Functions on XK7, but not optimally performing Expected to work very well on XC30 Must be explicitly enabled via run-time environment variable MPICH_RDMA_ENABLED_CUDA Works with both CUDA and OpenACC Version 5.6.4 adds a pipelining feature that should help large messages Enabled with MPICH_G2G_PIPELINE
  • 10. OMB Latency Host-to-Host will always have the lowest latency (fewest hops) Staging through host memory explicitly adds significant latency GPU-aware library is able to fall in the middle. Note: 2 nodes on separate blades.
  • 11. OMB Bandwidth Once again, H2H wins out (probably by a difference of latency) Direct RDMA suffers badly with this benchmark. Setting MPICH_G2G_PIPELINE=64 pipelines messages and opens up more concurrency. Hand- Pipelining Came ~Here
  • 12. OMB Bandwidth, Varying Pipelines OMB sends messages in a window of 64, so that is naturally optimal. Counter-intuitive that no intermediate values seemed to help. Additionally tried varying chunk sizes with no benefit.
  • 13. Optimizing Performance of a Message MPI vendors know how to optimize performance for an interconnect Different approaches for different message sizes Multiple algorithms Unfortunately, on the XK7, this may not be the optimal approach.
  • 14. MPI Lacks the Ability to Express Dependencies One way to negate the cost of G2G communication is to overlap with something computation. Restructuring order of computation may allow such overlapping Some communication patterns have a natural concurrency that isn’t easily exploited. N W E Exchange Cells With N,S,E,W Neighbors. S
  • 15. Exploiting Communication Concurrency This cannot be expressed in GPU-aware MPI today. Pack North D2H North Transfer North H2D North Unpack North Pack East D2H East Transfer East H2D East Unpack East Pack South D2H South Transfer South H2D South Unpack South Pack West D2H West Transfer West H2D West Unpack West
  • 16. Concurrent Messaging Pseudocode do i=N,W MPI_Irecv(i) enddo do i=N,W packBufferAsync(i) copyBufferD2HAsync(i) enddo while(more to send/recv) if Irecv completed copyBufferH2DAsync unpackBufferAsync() endif if D2H completed MPI_Isend() endif done
  • 17. Talks Related to this Optimization HOMME - Matt Norman – CUG2012 S3D - John Levesque – CUG2013
  • 18. Summary Optimizing kernels will only take you so far as you scale, communication cannot be an afterthought. GPU-aware MPI libraries are becoming available Easier to program Can optimize performance of individual message transfers Some communication patterns have a natural concurrency that can be exploited to make communication “free”, but this takes additional effort.