SlideShare una empresa de Scribd logo
1 de 15
Progress Toward Accelerating CAM-SE. Jeff Larkin <larkin@cray.com> Along with: Rick Archibald, Ilene Carpenter , Kate Evans, Paulius Micikevicius , Jim Rosinski, Jim Schwarzmeier, Mark Taylor
Background In 2009 ORNL asked many of their top users: What sort of science would you do on a 20 Petaflops machine in 2012? Answer to come on next slide Center for Accelerated Application Research (CAAR) established to determine: Can a set of codes from various disciplines be made to effectively use GPU accelerators with the combined efforts of domain scientists and vendors Each team has a science lead, code lead, members from ORNL, Cray, Nvidia, and elsewhere
CAM-SE Target Problem 1/8 degree CAM, using CAM-SE dynamical core and Mozart tropospheric chemistry.  Why is acceleration needed to “do” the problem? When including all the tracers associated with Mozart atmospheric chemistry, the simulation is too expensive to run at high resolution on today’s systems.  What unrealized parallelism needs to be exposed? In many parts of the dynamics, parallelism needs to include levels (k) and chemical constituents (q).
Profile of Runtime % of Runtime
Next Steps Once the dominant routines were identified, standalone kernels were created for each. Early efforts tested PGI & HMPP directive, plus CUDA C, CUDA Fortran, and OpenCL Directives-based compiler were too immature at the time Poor support for Fortran modules and derived types Did not allow implementation at a high enough level CUDA Fortran provided good performance while allowing us to remain in Fortran
Identifying Parallelism HOMME parallelizes both MPI and OpenMP over elements Most of the tracer advection can also parallelize over tracers (q) and levels (k) Vertical remap is the exception, due to vertical dependence in levels. Parallelizing over tracers and sometimes levels while threading over quadrature points (nv) provides ample parallelism within each element to utilize GPU effectively.
Status Euler_step & laplace_sphere_wk were straightforward to rewrite in CUDA Fortran Vertical Remap was rewritten to be more amenable to GPU (made it vectorize) Resulting code is 2X faster on CPU than original code and has been given back to the community Edge Packing/Unpacking for boundary exchange needs to be rewritten (Ilene talked about this already) Designed for 1 element per MPI rank, but we plan to run with more Once this is node-aware, it can also be device-aware and greatly reduce PCIe transfers Someone said yesterday: “As with many kernels, the ratio of FLOPS per by transfer determines successful acceleration.”
Status (cont.) Kernels were put back into HOMME and validation tests were run and passed This version did nothing to reduce data movement, only tested kernel accuracy In process of porting forward to current trunk and do more intelligent data movement Currently reevaluating directives now that compilers have matured Directives-based vertical remap now slightly outperforms hand-tuned CUDA Still working around derived_type issues
Challenges Data Structures (Object-Oriented Fortran) Every node has an array of element derived types, which contains more arrays We only care about some of these arrays, so data movement isn’t very natural We must essentially change many non-contiguous CPU arrays into a contiguous GPU array Parallelism occurs at various levels of the calltree, not just leaf routines, so compiler must be able to inline leaves in order to use directives Cray compiler handles this via whole program analysis, PGI compiler may support this via inline library
Challenges (cont.) CUDA Fortran requires everything live in the same module Must duplicate some routines and data structures from several module in our “cuda_mod” Insert ifdefs that hijack CPU routine calls and forward the request to matching cuda_mod routines Simple for user, but developer must maintain duplicate routines Hey Dave, when will this get changed? ;)
Until the Boundary Exchange is rewritten, euler_step performance is hampered by data movement. Streaming over elements helps, but may not be realistic for the full code.
With data transfer, laplace_sphere_wk is a wash, but since all necessary data is already resident from euler_step, kernel only time is realistic.
Vertical remap rewrite is 2X faster on the CPU and still faster on GPU.  All data already resident on device from euler_step, so kernel-only time is realistic.
Future Work Use CUDA 4.0 dynamic pinning of memory to allow overlapping & better PCIe performance Move forward to CAM5/CESM1 No chance of our work being used otherwise Some additional, small kernels are needed to allow data to remain resident Cheaper to run these on the GPU than to copy the data Reprofile with accelerated application to identify next most important routines Chemisty implicit solver is expected to be next Physics is expected to require mature, directives-based compiler Rinse, repeat
Conclusions Much has been done, much remains For a fairly new, cleanly written code, CUDA Fortran was tractable. HOMME has very similar loop nests throughout, that was key to making this possible Still results in multiple code paths to maintain, so we’d prefer to move to directives for the long-run We believe GPU accelerators will be beneficial for the selected problem We hope that it will also benefit a wider audience (CAM5 should help this)

Más contenido relacionado

La actualidad más candente

AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
Perfomance tuning on Go 2.0
Perfomance tuning on Go 2.0Perfomance tuning on Go 2.0
Perfomance tuning on Go 2.0
Yogi Kulkarni
 

La actualidad más candente (20)

Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
A Comparative Study between Honeybee Foraging Behaviour Algorithm and Round ...
A Comparative Study between Honeybee Foraging Behaviour Algorithm and  Round ...A Comparative Study between Honeybee Foraging Behaviour Algorithm and  Round ...
A Comparative Study between Honeybee Foraging Behaviour Algorithm and Round ...
 
Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...
Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...
Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...
 
High Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloudHigh Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloud
 
Workflowsim escience12
Workflowsim escience12Workflowsim escience12
Workflowsim escience12
 
Solaris Linux Performance, Tools and Tuning
Solaris Linux Performance, Tools and TuningSolaris Linux Performance, Tools and Tuning
Solaris Linux Performance, Tools and Tuning
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
ANSYS SCADE Usage for Unmanned Aircraft Vehicles
ANSYS SCADE Usage for Unmanned Aircraft VehiclesANSYS SCADE Usage for Unmanned Aircraft Vehicles
ANSYS SCADE Usage for Unmanned Aircraft Vehicles
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
ODSC West TidalScale Keynote Slides
ODSC West TidalScale Keynote SlidesODSC West TidalScale Keynote Slides
ODSC West TidalScale Keynote Slides
 
Container Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycleContainer Orchestrator Smackdown @ContinousLifecycle
Container Orchestrator Smackdown @ContinousLifecycle
 
Tackling Scaling Challenges of Apache Spark at LinkedIn
Tackling Scaling Challenges of Apache Spark at LinkedInTackling Scaling Challenges of Apache Spark at LinkedIn
Tackling Scaling Challenges of Apache Spark at LinkedIn
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
Moving CCAP To The Cloud
Moving CCAP To The CloudMoving CCAP To The Cloud
Moving CCAP To The Cloud
 
HyperDex: A Consistent, Fault-tolerant, Searchable, Transactional NoSQL Store...
HyperDex: A Consistent, Fault-tolerant, Searchable, Transactional NoSQL Store...HyperDex: A Consistent, Fault-tolerant, Searchable, Transactional NoSQL Store...
HyperDex: A Consistent, Fault-tolerant, Searchable, Transactional NoSQL Store...
 
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0
 
Ecss des
Ecss desEcss des
Ecss des
 
Perfomance tuning on Go 2.0
Perfomance tuning on Go 2.0Perfomance tuning on Go 2.0
Perfomance tuning on Go 2.0
 

Destacado (6)

Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
Drupal1a
Drupal1aDrupal1a
Drupal1a
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
1ºtecnologias emergentes-thinkepi-fesabid-2011-masmedios
1ºtecnologias emergentes-thinkepi-fesabid-2011-masmedios1ºtecnologias emergentes-thinkepi-fesabid-2011-masmedios
1ºtecnologias emergentes-thinkepi-fesabid-2011-masmedios
 

Similar a Progress Toward Accelerating CAM-SE

Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
Sri Prasanna
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
 

Similar a Progress Toward Accelerating CAM-SE (20)

Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
 
Super Computer
Super ComputerSuper Computer
Super Computer
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS An efficient-parallel-approach-fo...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS  An efficient-parallel-approach-fo...IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS  An efficient-parallel-approach-fo...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS An efficient-parallel-approach-fo...
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
 
Presentation on Large Scale Data Management
Presentation on Large Scale Data ManagementPresentation on Large Scale Data Management
Presentation on Large Scale Data Management
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
WALT vs PELT : Redux - SFO17-307
WALT vs PELT : Redux  - SFO17-307WALT vs PELT : Redux  - SFO17-307
WALT vs PELT : Redux - SFO17-307
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
A Collaborative Research Proposal To The NSF Research Accelerator For Multip...
A Collaborative Research Proposal To The NSF  Research Accelerator For Multip...A Collaborative Research Proposal To The NSF  Research Accelerator For Multip...
A Collaborative Research Proposal To The NSF Research Accelerator For Multip...
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Vector processor : Notes
Vector processor : NotesVector processor : Notes
Vector processor : Notes
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Iygapyisi cause10-slideshare
Iygapyisi cause10-slideshareIygapyisi cause10-slideshare
Iygapyisi cause10-slideshare
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 

Más de Jeff Larkin

CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
Jeff Larkin
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
Jeff Larkin
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
Jeff Larkin
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Jeff Larkin
 

Más de Jeff Larkin (13)

Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupBest Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
 
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
Performance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismPerformance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive Parallelism
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Progress Toward Accelerating CAM-SE

  • 1. Progress Toward Accelerating CAM-SE. Jeff Larkin <larkin@cray.com> Along with: Rick Archibald, Ilene Carpenter , Kate Evans, Paulius Micikevicius , Jim Rosinski, Jim Schwarzmeier, Mark Taylor
  • 2. Background In 2009 ORNL asked many of their top users: What sort of science would you do on a 20 Petaflops machine in 2012? Answer to come on next slide Center for Accelerated Application Research (CAAR) established to determine: Can a set of codes from various disciplines be made to effectively use GPU accelerators with the combined efforts of domain scientists and vendors Each team has a science lead, code lead, members from ORNL, Cray, Nvidia, and elsewhere
  • 3. CAM-SE Target Problem 1/8 degree CAM, using CAM-SE dynamical core and Mozart tropospheric chemistry. Why is acceleration needed to “do” the problem? When including all the tracers associated with Mozart atmospheric chemistry, the simulation is too expensive to run at high resolution on today’s systems. What unrealized parallelism needs to be exposed? In many parts of the dynamics, parallelism needs to include levels (k) and chemical constituents (q).
  • 4. Profile of Runtime % of Runtime
  • 5. Next Steps Once the dominant routines were identified, standalone kernels were created for each. Early efforts tested PGI & HMPP directive, plus CUDA C, CUDA Fortran, and OpenCL Directives-based compiler were too immature at the time Poor support for Fortran modules and derived types Did not allow implementation at a high enough level CUDA Fortran provided good performance while allowing us to remain in Fortran
  • 6. Identifying Parallelism HOMME parallelizes both MPI and OpenMP over elements Most of the tracer advection can also parallelize over tracers (q) and levels (k) Vertical remap is the exception, due to vertical dependence in levels. Parallelizing over tracers and sometimes levels while threading over quadrature points (nv) provides ample parallelism within each element to utilize GPU effectively.
  • 7. Status Euler_step & laplace_sphere_wk were straightforward to rewrite in CUDA Fortran Vertical Remap was rewritten to be more amenable to GPU (made it vectorize) Resulting code is 2X faster on CPU than original code and has been given back to the community Edge Packing/Unpacking for boundary exchange needs to be rewritten (Ilene talked about this already) Designed for 1 element per MPI rank, but we plan to run with more Once this is node-aware, it can also be device-aware and greatly reduce PCIe transfers Someone said yesterday: “As with many kernels, the ratio of FLOPS per by transfer determines successful acceleration.”
  • 8. Status (cont.) Kernels were put back into HOMME and validation tests were run and passed This version did nothing to reduce data movement, only tested kernel accuracy In process of porting forward to current trunk and do more intelligent data movement Currently reevaluating directives now that compilers have matured Directives-based vertical remap now slightly outperforms hand-tuned CUDA Still working around derived_type issues
  • 9. Challenges Data Structures (Object-Oriented Fortran) Every node has an array of element derived types, which contains more arrays We only care about some of these arrays, so data movement isn’t very natural We must essentially change many non-contiguous CPU arrays into a contiguous GPU array Parallelism occurs at various levels of the calltree, not just leaf routines, so compiler must be able to inline leaves in order to use directives Cray compiler handles this via whole program analysis, PGI compiler may support this via inline library
  • 10. Challenges (cont.) CUDA Fortran requires everything live in the same module Must duplicate some routines and data structures from several module in our “cuda_mod” Insert ifdefs that hijack CPU routine calls and forward the request to matching cuda_mod routines Simple for user, but developer must maintain duplicate routines Hey Dave, when will this get changed? ;)
  • 11. Until the Boundary Exchange is rewritten, euler_step performance is hampered by data movement. Streaming over elements helps, but may not be realistic for the full code.
  • 12. With data transfer, laplace_sphere_wk is a wash, but since all necessary data is already resident from euler_step, kernel only time is realistic.
  • 13. Vertical remap rewrite is 2X faster on the CPU and still faster on GPU. All data already resident on device from euler_step, so kernel-only time is realistic.
  • 14. Future Work Use CUDA 4.0 dynamic pinning of memory to allow overlapping & better PCIe performance Move forward to CAM5/CESM1 No chance of our work being used otherwise Some additional, small kernels are needed to allow data to remain resident Cheaper to run these on the GPU than to copy the data Reprofile with accelerated application to identify next most important routines Chemisty implicit solver is expected to be next Physics is expected to require mature, directives-based compiler Rinse, repeat
  • 15. Conclusions Much has been done, much remains For a fairly new, cleanly written code, CUDA Fortran was tractable. HOMME has very similar loop nests throughout, that was key to making this possible Still results in multiple code paths to maintain, so we’d prefer to move to directives for the long-run We believe GPU accelerators will be beneficial for the selected problem We hope that it will also benefit a wider audience (CAM5 should help this)

Notas del editor

  1. Added outlines to show how these bars relate to our kernels. Edge Packing &amp; Unpacking are part of the “Boundary Exchange”. Designed for maximum MPI scaling with one element per task and one task per core. Need to redesign for smaller number of more powerful nodes and lower surface/volume ratio.Verremap2 is the “Vertical Remap”“Euler Step” consists of euler_step, divergence_spere, and limiter2d_zeroNOTE: In the application, a boundary exchange occurs inside of euler_step“Laplace Sphere Weak” is a call to divergence_sphere_wk and gradient_sphere