SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
AcuSolve
Performance Benchmark and Profiling
The HPC Advisory Council

• World-wide HPC organization (240+ members)

• Bridges the gap between HPC usage and its potential

• Provides best practices and a support/development center

• Explores future technologies and future developments

• Working Groups – HPC|Cloud, HPC|Scale, HPC|GPU, HPC|Storage

• Leading edge solutions and technology demonstrations




                                                                2
HPC Advisory Council Members




                               3
HPC Advisory Council HPC Center


 Lustre                  GPU Cluster




             192 cores




             528 cores                 456 cores


                                                   4
2012 HPC Advisory Council Workshops



•   Germany Conference – June 17
•   Spain Conference – Sept 13
•   China Conference – October
•   US Stanford Conference – December

• For more information
  – www.hpcadvisorycouncil.com
  – info@hpcadvisorycouncil.com




                                        5
AcuSolve

 • AcuSolve
  – AcuSolve™ is a leading general-purpose finite element-based
    Computational Fluid Dynamics (CFD) flow solver with superior robustness,
    speed, and accuracy
  – AcuSolve can be used by designers and research engineers with all levels
    of expertise, either as a standalone product or seamlessly integrated into a
    powerful design and analysis application
  – With AcuSolve, users can quickly obtain quality solutions without iterating
    on solution procedures or worrying about mesh quality or topology




                                                                            6
Test Cluster Configuration
•   Dell™ PowerEdge™ M610 38-node (456-core) cluster
    – Six-Core Intel X5670 @ 2.93 GHz CPUs

    – Memory: 24GB memory, DDR3 1333 MHz

    – OS: RHEL 5.5, OFED 1.5.2 InfiniBand SW stack

•   Intel Cluster Ready certified cluster

•   Mellanox ConnectX-2 InfiniBand adapters and non-blocking switches

•   MPI: Intel MPI 3.0, MVAPICH2 1.0, Platform MPI 7.1

•   InfiniBand-based Lustre Storage: Lustre 1.8.5

•   Application: AcuSolve 1.8a

•   Benchmark datasets:
    – Pipe_fine (700 axial nodes, 3.04 million mesh points total, 17.8 million tetrahedral elements)

    – The test computes the steady state flow conditions for the turbulent flow (Re = 30000) of water in a
       pipe with heat transfer. The pipe is 1 meter in length and 150 cm in diameter. Water enters the inlet
       at room temperature conditions.
                                                                                                           7
AcuSolve Performance – Interconnects
• InfiniBand QDR enables higher cluster productivity
   – Provides more than 36% of job productivity over 1GigE network on benchmark problem
   – Savings in job productivity increases as cluster size increases
• 1GigE performance has a limited effect on performance for this benchmark
     •   Infers that the application is not as sensitive to network latency
• Test stops at 16-node for 1GigE due to switch port limitation




                                                                 36%




 Higher is better                                                             InfiniBand QDR

                                                                                          8
AcuSolve Performance – MPI Implementations
• Intel MPI performs better than Platform MPI
   – See around 16% higher performance at 32-node
   – Reflects that each Intel MPI efficiently handles MPI data transfers
• MVAPICH2 executable is only built with ch3:sock support for TCP network
   – Thus it does not reflect the true InfiniBand verbs performance as other MPI implementations




                                                                      16%




 Higher is better                                                             InfiniBand QDR

                                                                                          9
AcuSolve Performance – MPI & OpenMP Hybrid
•     On a single node, OpenMP Hybrid performs better than pure MPI
      – OpenMP provides faster results starting with 6 CPU cores (or 6 OpenMP threads)
      – OpenMP hybrid threads is a lighter weight alternative compared to MPI processes
•     Hybrid process enables scalability by minimizing process and communications
      – MPI communications are done by an MPI-OpenMP hybrid process on each node
      – The hybrid process is responsible for communications and spawning off worker threads
      – The OpenMP worker threads subsequently responsible for computation
• Graphs below compare Platform MPI to Platform MPI/OpenMP hybrid


                                                                                                    16%




    Higher is better                                                                      InfiniBand QDR

                                                                                                     10
AcuSolve Profiling – MPI/User Time Ratio
• Time spent in computation is more dominant than the MPI communication
  – MPI time only accounts for around 40% at 32-node
  – Actual computation run time reduces as the cluster scales
• OpenMP hybrid mode reduces overheads and yields more time for computation
  – Computation time: From 60% in pure MPI mode versus 77% in OpenMP hybrid mode




                                                                     InfiniBand QDR

                                                                                   11
AcuSolve Profiling – MPI Calls
• MPI_Recv and MPI_Isend are the most used MPI calls
  – Each accounts for ~42-43% of the MPI function calls on a 32-node job
• AcuSolve has large percentage of MPI calls for non-blocking data transfers
  – The non-blocking APIs allow transferring data while overlapping computation
  – Minimizes communications by using OpenMP hybrid
  – These 2 measures allow slow network to maintain decent productivity




                                                                                  12
AcuSolve Profiling – Time Spent by MPI Calls
• Majority of the MPI time is spent on MPI_Barrier and MPI_Allreduce
  – MPI_Barrier(43%), MPI_Allreduce(40%), MPI_Waitall(14%) on 32-node
• MPI communication time drops as cluster scales
  – Due to the faster total runtime, as more CPUs are working on completing the job faster
  – Reducing the communication time for each of the MPI calls




                                                                                        13
AcuSolve Profiling – MPI Message Sizes
• Most of the MPI messages are in the range of small to medium sizes
  – Most message sizes are less than 4KB
• The volume of MPI messages in MPI are significantly higher than hybrid
  – While the concentration of the messages stay within the same range




                                                                           14
AcuSolve Profiling – MPI Data Transfer

• As the cluster grows, substantial less data transfers between MPI processes
  – Reducing data communications from 20-30GB an single node simulation
  – To around 6GB for a 32-node simulation




                                                                          15
AcuSolve Profiling – MPI Data Transfer
• The amount of communications becomes more concentrated with hybrid mode
  – With 1 hybrid process launched for each node that is responsible for communications
  – Leaving the worker OpenMP threads for doing parallel computational routines
• At a result, the hybrid mode becomes a more efficient mode at scale
  – Even though larger data transfers takes place between MPI processes on each node




                                                                                       16
AcuSolve Profiling – Aggregated Transfer

• Aggregated data transfer refers to:
  – Total amount of data being transferred in the network between all MPI ranks collectively
• Large sum of data transfer takes place in AcuSolve
  – Seen around 2.5TB of data being exchanged between the nodes at 32-node in MPI
• The OpenMP hybrid mode reduces the overall traffic between the MPI processes
  – OpenMP has less than 870GB of data transferred, compared to 2.5TB for pure MPI case




                                                                               InfiniBand QDR

                                                                                         17
AcuSolve – Summary

• Performance
  – Acusolve is designed for superior performance and scalability
  – InfiniBand allows AcuSolve to run at the most efficient rate
  – Intel MPI produces higher parallel job efficiency than Platform MPI
  – The MVAPICH2 executable does not support communications over InfiniBand verbs
• MPI
  – By deploying non-blocking MPI calls, it overlaps computation with in-flight communications
  – Thus allowing it to achieve higher job performance while reducing communication needed
• OpenMP hybrid mode
  – By using the hybrid model, less data is needed be exchanged between nodes in a cluster
  – Thus allowing job to be done faster as more resources available for the computation
• Profiling
  – MPI_Isend and MPI_Recv are the most used MPI functions
  – OpenMP mode reduces the amount of network data transfer that needs to take place




                                                                                          18
Thank You
                                                           HPC Advisory Council




     All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and
     completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein


19                                                                                                                                                                                            19

Más contenido relacionado

Destacado

HTC 2012 Midsurfacing Training
HTC 2012 Midsurfacing TrainingHTC 2012 Midsurfacing Training
HTC 2012 Midsurfacing Training
Altair
 

Destacado (7)

Performance Improvement of Recently Updated FE Dummy Models - Humanetics
Performance Improvement of Recently Updated FE Dummy Models   - Humanetics Performance Improvement of Recently Updated FE Dummy Models   - Humanetics
Performance Improvement of Recently Updated FE Dummy Models - Humanetics
 
Improve Packaging Performance Using Simulation - Mabe
Improve Packaging Performance Using Simulation - Mabe Improve Packaging Performance Using Simulation - Mabe
Improve Packaging Performance Using Simulation - Mabe
 
Multi-physics with MotionSolve
Multi-physics with MotionSolveMulti-physics with MotionSolve
Multi-physics with MotionSolve
 
Spot Award
Spot AwardSpot Award
Spot Award
 
An Improved Subgrade Model for Crash Analysis of Guardrail Posts - University...
An Improved Subgrade Model for Crash Analysis of Guardrail Posts - University...An Improved Subgrade Model for Crash Analysis of Guardrail Posts - University...
An Improved Subgrade Model for Crash Analysis of Guardrail Posts - University...
 
Development of Tools to Streamline the Analysis of Turbo-Machinery - Cooper S...
Development of Tools to Streamline the Analysis of Turbo-Machinery - Cooper S...Development of Tools to Streamline the Analysis of Turbo-Machinery - Cooper S...
Development of Tools to Streamline the Analysis of Turbo-Machinery - Cooper S...
 
HTC 2012 Midsurfacing Training
HTC 2012 Midsurfacing TrainingHTC 2012 Midsurfacing Training
HTC 2012 Midsurfacing Training
 

Similar a AcuSolve Optimizations for Scale - Hpc advisory council

ddsf-student-presentation_756205.pptx
ddsf-student-presentation_756205.pptxddsf-student-presentation_756205.pptx
ddsf-student-presentation_756205.pptx
ssuser498be2
 
Understanding Low And Scalable Mpi Latency
Understanding Low And Scalable Mpi LatencyUnderstanding Low And Scalable Mpi Latency
Understanding Low And Scalable Mpi Latency
seiland
 
Understanding Low And Scalable Mpi Latency
Understanding Low And Scalable Mpi LatencyUnderstanding Low And Scalable Mpi Latency
Understanding Low And Scalable Mpi Latency
seiland
 

Similar a AcuSolve Optimizations for Scale - Hpc advisory council (20)

HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
Computer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer systemComputer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer system
 
High Performance Interconnects: Assessment & Rankings
High Performance Interconnects: Assessment & RankingsHigh Performance Interconnects: Assessment & Rankings
High Performance Interconnects: Assessment & Rankings
 
Streaming multiprocessors and HPC
Streaming multiprocessors and HPCStreaming multiprocessors and HPC
Streaming multiprocessors and HPC
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
High Performance Interconnects: Landscape, Assessments & Rankings
High Performance Interconnects: Landscape, Assessments & RankingsHigh Performance Interconnects: Landscape, Assessments & Rankings
High Performance Interconnects: Landscape, Assessments & Rankings
 
Gupta_Keynote_VTDC-3
Gupta_Keynote_VTDC-3Gupta_Keynote_VTDC-3
Gupta_Keynote_VTDC-3
 
Advanced processor principles
Advanced processor principlesAdvanced processor principles
Advanced processor principles
 
ddsf-student-presentation_756205.pptx
ddsf-student-presentation_756205.pptxddsf-student-presentation_756205.pptx
ddsf-student-presentation_756205.pptx
 
Understanding Low And Scalable Mpi Latency
Understanding Low And Scalable Mpi LatencyUnderstanding Low And Scalable Mpi Latency
Understanding Low And Scalable Mpi Latency
 
Understanding Low And Scalable Mpi Latency
Understanding Low And Scalable Mpi LatencyUnderstanding Low And Scalable Mpi Latency
Understanding Low And Scalable Mpi Latency
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
PEARC17: Improving Uintah's Scalability Through the Use of Portable Kokkos-Ba...
PEARC17: Improving Uintah's Scalability Through the Use of Portable Kokkos-Ba...PEARC17: Improving Uintah's Scalability Through the Use of Portable Kokkos-Ba...
PEARC17: Improving Uintah's Scalability Through the Use of Portable Kokkos-Ba...
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmap
 
PlatCon-19 ICMPv6SD
PlatCon-19 ICMPv6SDPlatCon-19 ICMPv6SD
PlatCon-19 ICMPv6SD
 
Segment Routing v6 (SRv6) Academy Update
Segment Routing v6 (SRv6) Academy Update Segment Routing v6 (SRv6) Academy Update
Segment Routing v6 (SRv6) Academy Update
 
Co-Design Architecture for Exascale
Co-Design Architecture for ExascaleCo-Design Architecture for Exascale
Co-Design Architecture for Exascale
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Study of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsStudy of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processors
 

Más de Altair

The Team H2politO: vehicles for low consumption competitions using HyperWorks
The Team H2politO: vehicles for low consumption competitions using HyperWorks The Team H2politO: vehicles for low consumption competitions using HyperWorks
The Team H2politO: vehicles for low consumption competitions using HyperWorks
Altair
 

Más de Altair (20)

Altair for Manufacturing Applications
Altair for Manufacturing ApplicationsAltair for Manufacturing Applications
Altair for Manufacturing Applications
 
Smart Product Development: Scalable Solutions for Your Entire Product Lifecycle
Smart Product Development: Scalable Solutions for Your Entire Product LifecycleSmart Product Development: Scalable Solutions for Your Entire Product Lifecycle
Smart Product Development: Scalable Solutions for Your Entire Product Lifecycle
 
Simplify and Scale FEA Post-Processing
Simplify and Scale FEA Post-Processing Simplify and Scale FEA Post-Processing
Simplify and Scale FEA Post-Processing
 
Designing for Sustainability: Altair's Customer Story
Designing for Sustainability: Altair's Customer StoryDesigning for Sustainability: Altair's Customer Story
Designing for Sustainability: Altair's Customer Story
 
why digital twin adoption rates are skyrocketing.pdf
why digital twin adoption rates are skyrocketing.pdfwhy digital twin adoption rates are skyrocketing.pdf
why digital twin adoption rates are skyrocketing.pdf
 
Can digital twins save the planet?
Can digital twins save the planet?Can digital twins save the planet?
Can digital twins save the planet?
 
Altair for Industrial Design Applications
Altair for Industrial Design ApplicationsAltair for Industrial Design Applications
Altair for Industrial Design Applications
 
Analyze performance and operations of truck fleets in real time
Analyze performance and operations of truck fleets in real timeAnalyze performance and operations of truck fleets in real time
Analyze performance and operations of truck fleets in real time
 
Powerful Customer Intelligence | Altair Knowledge Studio
Powerful Customer Intelligence | Altair Knowledge StudioPowerful Customer Intelligence | Altair Knowledge Studio
Powerful Customer Intelligence | Altair Knowledge Studio
 
Altair Data analytics for Healthcare.
Altair Data analytics for Healthcare.Altair Data analytics for Healthcare.
Altair Data analytics for Healthcare.
 
AI supported material test automation.
AI supported material test automation.AI supported material test automation.
AI supported material test automation.
 
Altair High-performance Computing (HPC) and Cloud
Altair High-performance Computing (HPC) and CloudAltair High-performance Computing (HPC) and Cloud
Altair High-performance Computing (HPC) and Cloud
 
No Code Data Transformation for Insurance with Altair Monarch
No Code Data Transformation for Insurance with Altair MonarchNo Code Data Transformation for Insurance with Altair Monarch
No Code Data Transformation for Insurance with Altair Monarch
 
Altair Data analytics for Banking, Financial Services and Insurance
Altair Data analytics for Banking, Financial Services and Insurance Altair Data analytics for Banking, Financial Services and Insurance
Altair Data analytics for Banking, Financial Services and Insurance
 
Altair data analytics and artificial intelligence solutions
Altair data analytics and artificial intelligence solutionsAltair data analytics and artificial intelligence solutions
Altair data analytics and artificial intelligence solutions
 
Are You Maximising the Potential of Composite Materials?
Are You Maximising the Potential of Composite Materials?Are You Maximising the Potential of Composite Materials?
Are You Maximising the Potential of Composite Materials?
 
Lead time reduction in CAE: Automated FEM Description Report
Lead time reduction in CAE:  Automated  FEM Description ReportLead time reduction in CAE:  Automated  FEM Description Report
Lead time reduction in CAE: Automated FEM Description Report
 
A way to reduce mass of gearbox housing
A way to reduce mass of gearbox housingA way to reduce mass of gearbox housing
A way to reduce mass of gearbox housing
 
The Team H2politO: vehicles for low consumption competitions using HyperWorks
The Team H2politO: vehicles for low consumption competitions using HyperWorks The Team H2politO: vehicles for low consumption competitions using HyperWorks
The Team H2politO: vehicles for low consumption competitions using HyperWorks
 
Improving of Assessment Quality of Fatigue Analysis Using: MS, FEMFAT and FEM...
Improving of Assessment Quality of Fatigue Analysis Using: MS, FEMFAT and FEM...Improving of Assessment Quality of Fatigue Analysis Using: MS, FEMFAT and FEM...
Improving of Assessment Quality of Fatigue Analysis Using: MS, FEMFAT and FEM...
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 

AcuSolve Optimizations for Scale - Hpc advisory council

  • 2. The HPC Advisory Council • World-wide HPC organization (240+ members) • Bridges the gap between HPC usage and its potential • Provides best practices and a support/development center • Explores future technologies and future developments • Working Groups – HPC|Cloud, HPC|Scale, HPC|GPU, HPC|Storage • Leading edge solutions and technology demonstrations 2
  • 4. HPC Advisory Council HPC Center Lustre GPU Cluster 192 cores 528 cores 456 cores 4
  • 5. 2012 HPC Advisory Council Workshops • Germany Conference – June 17 • Spain Conference – Sept 13 • China Conference – October • US Stanford Conference – December • For more information – www.hpcadvisorycouncil.com – info@hpcadvisorycouncil.com 5
  • 6. AcuSolve • AcuSolve – AcuSolve™ is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver with superior robustness, speed, and accuracy – AcuSolve can be used by designers and research engineers with all levels of expertise, either as a standalone product or seamlessly integrated into a powerful design and analysis application – With AcuSolve, users can quickly obtain quality solutions without iterating on solution procedures or worrying about mesh quality or topology 6
  • 7. Test Cluster Configuration • Dell™ PowerEdge™ M610 38-node (456-core) cluster – Six-Core Intel X5670 @ 2.93 GHz CPUs – Memory: 24GB memory, DDR3 1333 MHz – OS: RHEL 5.5, OFED 1.5.2 InfiniBand SW stack • Intel Cluster Ready certified cluster • Mellanox ConnectX-2 InfiniBand adapters and non-blocking switches • MPI: Intel MPI 3.0, MVAPICH2 1.0, Platform MPI 7.1 • InfiniBand-based Lustre Storage: Lustre 1.8.5 • Application: AcuSolve 1.8a • Benchmark datasets: – Pipe_fine (700 axial nodes, 3.04 million mesh points total, 17.8 million tetrahedral elements) – The test computes the steady state flow conditions for the turbulent flow (Re = 30000) of water in a pipe with heat transfer. The pipe is 1 meter in length and 150 cm in diameter. Water enters the inlet at room temperature conditions. 7
  • 8. AcuSolve Performance – Interconnects • InfiniBand QDR enables higher cluster productivity – Provides more than 36% of job productivity over 1GigE network on benchmark problem – Savings in job productivity increases as cluster size increases • 1GigE performance has a limited effect on performance for this benchmark • Infers that the application is not as sensitive to network latency • Test stops at 16-node for 1GigE due to switch port limitation 36% Higher is better InfiniBand QDR 8
  • 9. AcuSolve Performance – MPI Implementations • Intel MPI performs better than Platform MPI – See around 16% higher performance at 32-node – Reflects that each Intel MPI efficiently handles MPI data transfers • MVAPICH2 executable is only built with ch3:sock support for TCP network – Thus it does not reflect the true InfiniBand verbs performance as other MPI implementations 16% Higher is better InfiniBand QDR 9
  • 10. AcuSolve Performance – MPI & OpenMP Hybrid • On a single node, OpenMP Hybrid performs better than pure MPI – OpenMP provides faster results starting with 6 CPU cores (or 6 OpenMP threads) – OpenMP hybrid threads is a lighter weight alternative compared to MPI processes • Hybrid process enables scalability by minimizing process and communications – MPI communications are done by an MPI-OpenMP hybrid process on each node – The hybrid process is responsible for communications and spawning off worker threads – The OpenMP worker threads subsequently responsible for computation • Graphs below compare Platform MPI to Platform MPI/OpenMP hybrid 16% Higher is better InfiniBand QDR 10
  • 11. AcuSolve Profiling – MPI/User Time Ratio • Time spent in computation is more dominant than the MPI communication – MPI time only accounts for around 40% at 32-node – Actual computation run time reduces as the cluster scales • OpenMP hybrid mode reduces overheads and yields more time for computation – Computation time: From 60% in pure MPI mode versus 77% in OpenMP hybrid mode InfiniBand QDR 11
  • 12. AcuSolve Profiling – MPI Calls • MPI_Recv and MPI_Isend are the most used MPI calls – Each accounts for ~42-43% of the MPI function calls on a 32-node job • AcuSolve has large percentage of MPI calls for non-blocking data transfers – The non-blocking APIs allow transferring data while overlapping computation – Minimizes communications by using OpenMP hybrid – These 2 measures allow slow network to maintain decent productivity 12
  • 13. AcuSolve Profiling – Time Spent by MPI Calls • Majority of the MPI time is spent on MPI_Barrier and MPI_Allreduce – MPI_Barrier(43%), MPI_Allreduce(40%), MPI_Waitall(14%) on 32-node • MPI communication time drops as cluster scales – Due to the faster total runtime, as more CPUs are working on completing the job faster – Reducing the communication time for each of the MPI calls 13
  • 14. AcuSolve Profiling – MPI Message Sizes • Most of the MPI messages are in the range of small to medium sizes – Most message sizes are less than 4KB • The volume of MPI messages in MPI are significantly higher than hybrid – While the concentration of the messages stay within the same range 14
  • 15. AcuSolve Profiling – MPI Data Transfer • As the cluster grows, substantial less data transfers between MPI processes – Reducing data communications from 20-30GB an single node simulation – To around 6GB for a 32-node simulation 15
  • 16. AcuSolve Profiling – MPI Data Transfer • The amount of communications becomes more concentrated with hybrid mode – With 1 hybrid process launched for each node that is responsible for communications – Leaving the worker OpenMP threads for doing parallel computational routines • At a result, the hybrid mode becomes a more efficient mode at scale – Even though larger data transfers takes place between MPI processes on each node 16
  • 17. AcuSolve Profiling – Aggregated Transfer • Aggregated data transfer refers to: – Total amount of data being transferred in the network between all MPI ranks collectively • Large sum of data transfer takes place in AcuSolve – Seen around 2.5TB of data being exchanged between the nodes at 32-node in MPI • The OpenMP hybrid mode reduces the overall traffic between the MPI processes – OpenMP has less than 870GB of data transferred, compared to 2.5TB for pure MPI case InfiniBand QDR 17
  • 18. AcuSolve – Summary • Performance – Acusolve is designed for superior performance and scalability – InfiniBand allows AcuSolve to run at the most efficient rate – Intel MPI produces higher parallel job efficiency than Platform MPI – The MVAPICH2 executable does not support communications over InfiniBand verbs • MPI – By deploying non-blocking MPI calls, it overlaps computation with in-flight communications – Thus allowing it to achieve higher job performance while reducing communication needed • OpenMP hybrid mode – By using the hybrid model, less data is needed be exchanged between nodes in a cluster – Thus allowing job to be done faster as more resources available for the computation • Profiling – MPI_Isend and MPI_Recv are the most used MPI functions – OpenMP mode reduces the amount of network data transfer that needs to take place 18
  • 19. Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 19 19