SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
Multi-GPU simulations in OpenFOAM with
SpeedIT technology.
Attempt I: SpeedIT
¡ GPU-based library of iterative solvers for Sparse Linear
   Algebra and CFD. 
 ¡ Current version: 2.2. Version 1.0 in 2008.
 ¡ CMRS format for fast SpMV and other storage formats. 
 ¡ BiCGSTAB and CG iterative solvers. 
 ¡ Preconditioners: Jacobi, AMG. 
 ¡ Runs on one or more GPU cards.
 ¡ OpenFOAM compatibile.

¡ Possible Applications:
 ¡  Quantum Chemistry, 
 ¡  Seminconductor and Power Network Design
 ¡  CFD (OpenFOAM)
 ¡  etc.
SpeedIT and OpenFOAM



                         Plugin to OpenFOAM	





                                                      ¡ GPL-based generic Plugin to
¡ Provides conversion between CSR                      OpenFOAM.
  and OpenFOAM LDU data formats.
¡ Easy substitution of CG/BCG with                   ¡ Any acceleration toolkit can be
  their GPU versions.
                                  integrated now.
                                                      
                                                 1.  Edit system/controlDict file and add libSpeedIT.so	

¡ OpenFOAM source code is                       2.  Edit system/fvSolution and replace CG with CG_accel	

  unchanged due to a plugin                      3.  Compile the Plugin to OpenFOAM (wmake)	

  concept. No recompilation of                   4.  Complete library dependencies: $FOAM_USER_LIBBIN
                                                     should contain GPU-based libraries.	

  OpenFOAM is necessary.
0.2
                 10         100   1000                                         1           10                 100   1000
                        µ                                                                                 µ


                      SpeedIT vs. CUDA 5.0
nline) The speedup of our CMRS implementation of
 gainst scalar, vector and HYB kernels as a function
                                                       Figure 2: (color online) The speedup of our CMRS implem
                                                       against the best of all five alternative SpMV implementat
ber of nonzero elements per row (µ). Results for       given matrix.
are omitted for clarity.               SpMV is a key operation in scientific software.	


                                                                         3.5
         vector                                                                                     1.4
rid kernels give the shortest SpMV times
    10   scalar                                                           3
           HYB
 their efficiency decreases as µ is increased,                                                        1.2
                                                                     2.5
r kernel being very inefficient for large µ.
  CMRS speedup




                                                                                     CMRS speedup
                                                                                                1
of the scalar kernel is related to its inability




                                                               speedup
                                                                       2
a transfers if µ is large. The vector kernel                                                  0.8       2
                                                                     1.5
  the opposite way: its efficiency relative to                                                          1.5
       1
   very good for large µ, but it decreases as                          1                      0.6       1

 ≈ 100. This is related to the fact that the                                                          0.5
                                                                     0.5                      0.4
 ocesses each row with a warp of 32 threads,                                                            0
                                                                                                          0      2     4
 ost 0.2
      efficient for sparse matrices with µ far                           0                      0.2
                                                                         0        10 20   30    401 50     60   70
                                                                                                                 10 80   90 100
   Interestingly, for 10 120 the speedup of
         1                 µ           100              1000                                                                 100

r the vector kernel can be approximated as
                                     µ                                                              σ                      µ
           Fig. Speedup of our CMRS implementation against               Fig. Speedup of our CMRS implementation against
  a power law.
           scalar, vector (CSR) and HYB kernels as a function of                                            	

                                                              Figure 3: (color (HYB/CSR formats). online) The speedup
 Figure 1: (color online) The speedup of our CMRS implementation of
                                                                         CUDA 5.0 online) The speedup of CMRS against
                                                                                        Figure 2: (color
  it can be immediately seen that the CMRS (µ). plementation as a function of σ. Only matrices such that
           the mean number of nonzero elements per row        	

	

          	

	

 the SpMV kernel against scalar, vector and HYB kernels as a function                   against the best of all five alternative
ly does not yield much improvement over                       taken into account. given matrix.
                                                                                        Inset: enlargement of the plot for σ ≤
 of the mean number of nonzero elements per row (µ). Results for
at for µ
 lp1-like matrices are omitted forto be system-
                  150, and tends clarity.
 than the HYB format for µ                20. Hence
 t the advantages of the CMRS will be most                    number of nonzero elements (ni ) per row
                                                                                             3.5
OpenFOAM & single GPU

Low-cost hardware
                     GPU cost: $200	


Test cases:


¡ Cavity 3D, 512K cells, icoFoam. 
¡ Human aorta, 500K cells, simpleFoam.
¡ CPU: Intel Core 2 Duo E8400, 3GHz, 8GB RAM @ 800MHz

  GPU: NVIDIA GTX 460, VRAM 1GB

  Ubuntu 11.04 x64, OpenFOAM 2.0.1, CUDA Toolkit 4.1, 

¡ Pressure equation solved with CG preconditioned with DIC, GAMG (CPU)
  and CG preconditioned with AMG (GPU).
OpenFOAM & single GPU

Validation




Fig.Velocity and pressure profiles along x axis for aorta (top left) and pressure profile along x axis for cavity3D (top right).
Solution for all preconditioners.
Cross-section for velocity in X and Y direction for cavity3D run with OpenFOAM and SpeedIT (lines) and from literature for
Re=100 (dotted line) and Re=400 (solid line) without turbulence model (icoFoam).
OpenFOAM  single GPU

 Performance

2000
                                                                          2500



1600
                                                                          2000



1200
                                                                               1500



 800
                                                                               1000


 400
                                                                                500


   0
        CG+DIC 1C
    CG+DIC 2C
       GAMG 1C
      SpeedIT
       GAMG 2C
      0
                                                                                         CG+DIC 2C
   CG+DIC 1C
    SpeedIT
      GAMG 2C
      GAMG 1C
4.00
        3.77x
3.50
                                                                                  3.5
                                                                                                           2.88x
3.00
                         2.40x
                                                                                         3
     2.68x
2.50
                                                                                  2.5

2.00
                                                                                    2

1.50
                                       1.27x
                                     1.5
1.00
                                                           0.76
                    1
                                                                                                                          0.87
         0.91
0.50
                                                                                       0.5
0.00
                                                                                         0
         CG+DIC 1C
       CG+DIC 2C
          GAMG 1C
          GAMG 2C
                                                                                              CG+DIC 1C
   CG+DIC 2C
    GAMG 2C
      GAMG 1C


    Fig. Execution times of 50 time steps (up) and acceleration factor for simpleFoam (left) and pisoFoam (right). Performed at
    Intel E8400@3GHz and GTX460. 1C and 2C stand for one, two cores respectively.
OpenFoam  multi-GPU
¡ Motivation: large geometries ( 8M celss) do not fit to a
  single GPU 

  card (with max. 6GB RAM).

¡ OpenFOAM performed the decomposition. MPI was used
  for communication between GPU cards.

¡ Tests were performed at IBM PLX CINECA with 2 six-cores
  Intel Westmere 2.40 GHz and 2 Tesla M2070 per node
  (supported by NVIDIA). One thread manages one GPU card.

¡ Tests focused on multi-GPU precondtioned CG with AMG
  to solve pressure equation in steady-state flows.
OpenFoam  multi-GPU
6000

5000
                                                               CPU	

                  GPU	


4000

                                                                     CPU, PCG with GAMG
3000
                                                                     CPU, PCG with diagonal
            x1.59	

         x1.63	

                                SpeedIT
2000
              1353
                             1207
        x1.62	

     x1.15	

1000
                                       699
       394
                  848
          736
                                            429
          342
   0
             6
            8
            16
            32

Fig. Time in sec. of first 10 time steps for N GPUs and CPU cores. Motorbike test, simpleFoam with diagonal
(blue) and GAMG (green) preconditioners, SpeedIT is marked in red. Geometry is 32 millions cells. Tests
were performed at PLX CINECA.
Scaling of OpenFOAM  multi-GPU
      ¡ Complex car geometry, external steady state flow, simpleFOAM, 32M
        cells, pressure: GAMG, velocity: smoothGS. 
      
                                                                                     SpeedIT multiGPU
                Ideal Scaling
                                                                  12.00
140000
                                                                                                                                                 11
120000
                                                           10.00

                                                                                                                                                 9.23
100000
                                                                   8.00
                                                           8
                                                                                                                                   7.11
 80000
                                                                   6.00

 60000

                                                                   4.00
                                           4
 40000
                                                                                                              3.63

 20000
                                                                   2.00
                        2
                                                                                                     1.93

     0
                                                            0.00
           8
         16
        32
        64
        88
                    8
          16
                32
             64
           88




     Fig. Execution times of 2950 time steps (left) and scaling relative to 8 GPU (right).
OpenFOAM  GPU for industries


                                                                                                  Authors: Saeed Iqbal and Kevin Tubbs	





                                                                                                     Full report can be
                                                                                                     obtained at
                                                                                                     Dell Tech Center .	





    Figure 1: OpenFOAM performance of 3D cavity case using 4 million cells on a single node.
OpenFOAM  GPU for industries


                                                                                                          Authors: Saeed Iqbal and Kevin Tubbs	





    Figure 2: Total Power and Power Efficiency of 3D cavity case on 4 million cells on a single node.
OpenFOAM  GPU for industries


                                                                                                  Authors: Saeed Iqbal and Kevin Tubbs	





    Figure 1: OpenFOAM performance of 3D cavity case using 8 million cells on a single node.
OpenFOAM  GPU for industries


                                                                                                          Authors: Saeed Iqbal and Kevin Tubbs	





    Figure 2: Total Power and Power Efficiency of 3D cavity case on 4 million cells on a single node.
SpeedIT 3.0

What’s new:


¡ New ILU based preconditioner.

¡ Support for Keppler NVIDIA cards.

    ¡ much higher bandwidth (300GB/s).

    ¡ 5x faster intra-GPU communication thanks to GPU Direct 2.0.

¡ Release in the begining of 2013.
Acknowledgments
    ¡ Saeed Iqbal and Kevin Tubbs (DELL)
    ¡ Stan Posey, Edmondo Orlotti (NVIDIA)
    ¡ CINECA, PRACE
    





                                    Vratis Ltd.
                        Muchoborska 18, Wroclaw, Poland
                        Email: lukasz.miroslaw@vratis.com


              More information: speed-it.vratis.com, vratis.com/blog

Más contenido relacionado

La actualidad más candente

Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUsPerformance Evaluation of SAR Image Reconstruction on CPUs and GPUs
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUsFisnik Kraja
 
Simple asynchronous remote invocations for distributed real-time Java
Simple asynchronous remote invocations for distributed real-time JavaSimple asynchronous remote invocations for distributed real-time Java
Simple asynchronous remote invocations for distributed real-time JavaUniversidad Carlos III de Madrid
 
Ethernet Alliance Tech Exploration Forum Keynote
Ethernet Alliance Tech Exploration Forum KeynoteEthernet Alliance Tech Exploration Forum Keynote
Ethernet Alliance Tech Exploration Forum Keynotebkoley
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceresearchinventy
 
Ieee 802.11 distributed coordination function with exponential increase expon...
Ieee 802.11 distributed coordination function with exponential increase expon...Ieee 802.11 distributed coordination function with exponential increase expon...
Ieee 802.11 distributed coordination function with exponential increase expon...ambitlick
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsMário Almeida
 
Ofdm sim-matlab-code-tutorial web for EE students
Ofdm sim-matlab-code-tutorial web for EE studentsOfdm sim-matlab-code-tutorial web for EE students
Ofdm sim-matlab-code-tutorial web for EE studentsMike Martin
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
Ad hoc routing
Ad hoc routingAd hoc routing
Ad hoc routingits
 
Performance Analysis of Interconnect Drivers for Ultralow Power Applications
Performance Analysis of Interconnect Drivers for Ultralow Power ApplicationsPerformance Analysis of Interconnect Drivers for Ultralow Power Applications
Performance Analysis of Interconnect Drivers for Ultralow Power ApplicationsIDES Editor
 
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...IDES Editor
 

La actualidad más candente (19)

Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUsPerformance Evaluation of SAR Image Reconstruction on CPUs and GPUs
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs
 
Evaluation aodv
Evaluation aodvEvaluation aodv
Evaluation aodv
 
Simple asynchronous remote invocations for distributed real-time Java
Simple asynchronous remote invocations for distributed real-time JavaSimple asynchronous remote invocations for distributed real-time Java
Simple asynchronous remote invocations for distributed real-time Java
 
2011.jtr.pbasanta.
2011.jtr.pbasanta.2011.jtr.pbasanta.
2011.jtr.pbasanta.
 
No Heap Remote Objects for Distributed real-time Java
No Heap Remote Objects for Distributed real-time JavaNo Heap Remote Objects for Distributed real-time Java
No Heap Remote Objects for Distributed real-time Java
 
Ethernet Alliance Tech Exploration Forum Keynote
Ethernet Alliance Tech Exploration Forum KeynoteEthernet Alliance Tech Exploration Forum Keynote
Ethernet Alliance Tech Exploration Forum Keynote
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Ieee 802.11 distributed coordination function with exponential increase expon...
Ieee 802.11 distributed coordination function with exponential increase expon...Ieee 802.11 distributed coordination function with exponential increase expon...
Ieee 802.11 distributed coordination function with exponential increase expon...
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache Simulations
 
Ofdm sim-matlab-code-tutorial web for EE students
Ofdm sim-matlab-code-tutorial web for EE studentsOfdm sim-matlab-code-tutorial web for EE students
Ofdm sim-matlab-code-tutorial web for EE students
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
Ad hoc routing
Ad hoc routingAd hoc routing
Ad hoc routing
 
Bs24461465
Bs24461465Bs24461465
Bs24461465
 
Performance Analysis of Interconnect Drivers for Ultralow Power Applications
Performance Analysis of Interconnect Drivers for Ultralow Power ApplicationsPerformance Analysis of Interconnect Drivers for Ultralow Power Applications
Performance Analysis of Interconnect Drivers for Ultralow Power Applications
 
A0960104
A0960104A0960104
A0960104
 
FlexRay Product Days RTaW
FlexRay Product Days RTaWFlexRay Product Days RTaW
FlexRay Product Days RTaW
 
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
A Low Control Overhead Cluster Maintenance Scheme for Mobile Ad hoc NETworks ...
 
Fw3111471149
Fw3111471149Fw3111471149
Fw3111471149
 

Similar a SpeedIT : GPU-based acceleration of sparse linear algebra

Pushing the limits of CAN - Scheduling frames with offsets provides a major p...
Pushing the limits of CAN - Scheduling frames with offsets provides a major p...Pushing the limits of CAN - Scheduling frames with offsets provides a major p...
Pushing the limits of CAN - Scheduling frames with offsets provides a major p...Nicolas Navet
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Types Of Window Being Used For The Selected Granule
Types Of Window Being Used For The Selected GranuleTypes Of Window Being Used For The Selected Granule
Types Of Window Being Used For The Selected GranuleLeslie Lee
 
Implementation of Area Effective Carry Select Adders
Implementation of Area Effective Carry Select AddersImplementation of Area Effective Carry Select Adders
Implementation of Area Effective Carry Select AddersKumar Goud
 
Reproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral ModelsReproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral Modelsfnothaft
 
SRAM read and write and sense amplifier
SRAM read and write and sense amplifierSRAM read and write and sense amplifier
SRAM read and write and sense amplifierSoumyajit Langal
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...NECST Lab @ Politecnico di Milano
 
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Lviv Startup Club
 
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATESPERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATESBUKYABALAJI
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
Process Variation and Radiation-Immune Single Ended 6T SRAM Cell
Process Variation and Radiation-Immune Single Ended 6T SRAM CellProcess Variation and Radiation-Immune Single Ended 6T SRAM Cell
Process Variation and Radiation-Immune Single Ended 6T SRAM CellIDES Editor
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicEric Verhulst
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 
IRJET- Comparative Analysis of High Speed SRAM Cell for 90nm CMOS Technology
IRJET- Comparative Analysis of High Speed SRAM Cell for 90nm CMOS TechnologyIRJET- Comparative Analysis of High Speed SRAM Cell for 90nm CMOS Technology
IRJET- Comparative Analysis of High Speed SRAM Cell for 90nm CMOS TechnologyIRJET Journal
 

Similar a SpeedIT : GPU-based acceleration of sparse linear algebra (20)

Pushing the limits of CAN - Scheduling frames with offsets provides a major p...
Pushing the limits of CAN - Scheduling frames with offsets provides a major p...Pushing the limits of CAN - Scheduling frames with offsets provides a major p...
Pushing the limits of CAN - Scheduling frames with offsets provides a major p...
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
ADC
ADCADC
ADC
 
Types Of Window Being Used For The Selected Granule
Types Of Window Being Used For The Selected GranuleTypes Of Window Being Used For The Selected Granule
Types Of Window Being Used For The Selected Granule
 
Implementation of Area Effective Carry Select Adders
Implementation of Area Effective Carry Select AddersImplementation of Area Effective Carry Select Adders
Implementation of Area Effective Carry Select Adders
 
Reproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral ModelsReproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral Models
 
SRAM read and write and sense amplifier
SRAM read and write and sense amplifierSRAM read and write and sense amplifier
SRAM read and write and sense amplifier
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
 
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
 
Solution(1)
Solution(1)Solution(1)
Solution(1)
 
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATESPERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
PERFORMANCE ANALYSIS OF SRAM CELL USING REVERSIBLE LOGIC GATES
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Process Variation and Radiation-Immune Single Ended 6T SRAM Cell
Process Variation and Radiation-Immune Single Ended 6T SRAM CellProcess Variation and Radiation-Immune Single Ended 6T SRAM Cell
Process Variation and Radiation-Immune Single Ended 6T SRAM Cell
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
 
rbm_hls
rbm_hlsrbm_hls
rbm_hls
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
IRJET- Comparative Analysis of High Speed SRAM Cell for 90nm CMOS Technology
IRJET- Comparative Analysis of High Speed SRAM Cell for 90nm CMOS TechnologyIRJET- Comparative Analysis of High Speed SRAM Cell for 90nm CMOS Technology
IRJET- Comparative Analysis of High Speed SRAM Cell for 90nm CMOS Technology
 
D031201021027
D031201021027D031201021027
D031201021027
 

Más de University of Zurich

Blood Flow Simulations in the Cloud
Blood Flow Simulations in the CloudBlood Flow Simulations in the Cloud
Blood Flow Simulations in the CloudUniversity of Zurich
 
Multimodal Image Processing in Cytology
Multimodal Image Processing in CytologyMultimodal Image Processing in Cytology
Multimodal Image Processing in CytologyUniversity of Zurich
 
Evolutionary-driven Optimization in Computational Chemistry
Evolutionary-driven Optimization in Computational ChemistryEvolutionary-driven Optimization in Computational Chemistry
Evolutionary-driven Optimization in Computational ChemistryUniversity of Zurich
 
Evolution-based Reaction Path Following
Evolution-based Reaction Path FollowingEvolution-based Reaction Path Following
Evolution-based Reaction Path FollowingUniversity of Zurich
 
Artificial Intelligence in High Content Screening and Cervical Cancer Diagnosis
Artificial Intelligence in High Content Screening and Cervical Cancer DiagnosisArtificial Intelligence in High Content Screening and Cervical Cancer Diagnosis
Artificial Intelligence in High Content Screening and Cervical Cancer DiagnosisUniversity of Zurich
 

Más de University of Zurich (7)

Blood Flow Simulations in the Cloud
Blood Flow Simulations in the CloudBlood Flow Simulations in the Cloud
Blood Flow Simulations in the Cloud
 
Multimodal Image Processing in Cytology
Multimodal Image Processing in CytologyMultimodal Image Processing in Cytology
Multimodal Image Processing in Cytology
 
Content-based Image Retrieval
Content-based Image RetrievalContent-based Image Retrieval
Content-based Image Retrieval
 
Evolutionary-driven Optimization in Computational Chemistry
Evolutionary-driven Optimization in Computational ChemistryEvolutionary-driven Optimization in Computational Chemistry
Evolutionary-driven Optimization in Computational Chemistry
 
Evolution-based Reaction Path Following
Evolution-based Reaction Path FollowingEvolution-based Reaction Path Following
Evolution-based Reaction Path Following
 
Artificial Intelligence in High Content Screening and Cervical Cancer Diagnosis
Artificial Intelligence in High Content Screening and Cervical Cancer DiagnosisArtificial Intelligence in High Content Screening and Cervical Cancer Diagnosis
Artificial Intelligence in High Content Screening and Cervical Cancer Diagnosis
 
SpeedIT FLOW
SpeedIT FLOWSpeedIT FLOW
SpeedIT FLOW
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

SpeedIT : GPU-based acceleration of sparse linear algebra

  • 1. Multi-GPU simulations in OpenFOAM with SpeedIT technology.
  • 2. Attempt I: SpeedIT ¡ GPU-based library of iterative solvers for Sparse Linear Algebra and CFD. ¡ Current version: 2.2. Version 1.0 in 2008. ¡ CMRS format for fast SpMV and other storage formats. ¡ BiCGSTAB and CG iterative solvers. ¡ Preconditioners: Jacobi, AMG. ¡ Runs on one or more GPU cards. ¡ OpenFOAM compatibile. ¡ Possible Applications: ¡  Quantum Chemistry, ¡  Seminconductor and Power Network Design ¡  CFD (OpenFOAM) ¡  etc.
  • 3. SpeedIT and OpenFOAM Plugin to OpenFOAM ¡ GPL-based generic Plugin to ¡ Provides conversion between CSR OpenFOAM. and OpenFOAM LDU data formats. ¡ Easy substitution of CG/BCG with ¡ Any acceleration toolkit can be their GPU versions. integrated now. 1.  Edit system/controlDict file and add libSpeedIT.so ¡ OpenFOAM source code is 2.  Edit system/fvSolution and replace CG with CG_accel unchanged due to a plugin 3.  Compile the Plugin to OpenFOAM (wmake) concept. No recompilation of 4.  Complete library dependencies: $FOAM_USER_LIBBIN should contain GPU-based libraries. OpenFOAM is necessary.
  • 4. 0.2 10 100 1000 1 10 100 1000 µ µ SpeedIT vs. CUDA 5.0 nline) The speedup of our CMRS implementation of gainst scalar, vector and HYB kernels as a function Figure 2: (color online) The speedup of our CMRS implem against the best of all five alternative SpMV implementat ber of nonzero elements per row (µ). Results for given matrix. are omitted for clarity. SpMV is a key operation in scientific software. 3.5 vector 1.4 rid kernels give the shortest SpMV times 10 scalar 3 HYB their efficiency decreases as µ is increased, 1.2 2.5 r kernel being very inefficient for large µ. CMRS speedup CMRS speedup 1 of the scalar kernel is related to its inability speedup 2 a transfers if µ is large. The vector kernel 0.8 2 1.5 the opposite way: its efficiency relative to 1.5 1 very good for large µ, but it decreases as 1 0.6 1 ≈ 100. This is related to the fact that the 0.5 0.5 0.4 ocesses each row with a warp of 32 threads, 0 0 2 4 ost 0.2 efficient for sparse matrices with µ far 0 0.2 0 10 20 30 401 50 60 70 10 80 90 100 Interestingly, for 10 120 the speedup of 1 µ 100 1000 100 r the vector kernel can be approximated as µ σ µ Fig. Speedup of our CMRS implementation against Fig. Speedup of our CMRS implementation against a power law. scalar, vector (CSR) and HYB kernels as a function of Figure 3: (color (HYB/CSR formats). online) The speedup Figure 1: (color online) The speedup of our CMRS implementation of CUDA 5.0 online) The speedup of CMRS against Figure 2: (color it can be immediately seen that the CMRS (µ). plementation as a function of σ. Only matrices such that the mean number of nonzero elements per row the SpMV kernel against scalar, vector and HYB kernels as a function against the best of all five alternative ly does not yield much improvement over taken into account. given matrix. Inset: enlargement of the plot for σ ≤ of the mean number of nonzero elements per row (µ). Results for at for µ lp1-like matrices are omitted forto be system- 150, and tends clarity. than the HYB format for µ 20. Hence t the advantages of the CMRS will be most number of nonzero elements (ni ) per row 3.5
  • 5. OpenFOAM & single GPU
 Low-cost hardware GPU cost: $200 Test cases:
 ¡ Cavity 3D, 512K cells, icoFoam. ¡ Human aorta, 500K cells, simpleFoam. ¡ CPU: Intel Core 2 Duo E8400, 3GHz, 8GB RAM @ 800MHz
 GPU: NVIDIA GTX 460, VRAM 1GB
 Ubuntu 11.04 x64, OpenFOAM 2.0.1, CUDA Toolkit 4.1, ¡ Pressure equation solved with CG preconditioned with DIC, GAMG (CPU) and CG preconditioned with AMG (GPU).
  • 6. OpenFOAM & single GPU
 Validation Fig.Velocity and pressure profiles along x axis for aorta (top left) and pressure profile along x axis for cavity3D (top right). Solution for all preconditioners. Cross-section for velocity in X and Y direction for cavity3D run with OpenFOAM and SpeedIT (lines) and from literature for Re=100 (dotted line) and Re=400 (solid line) without turbulence model (icoFoam).
  • 7. OpenFOAM single GPU
 Performance 2000 2500 1600 2000 1200 1500 800 1000 400 500 0 CG+DIC 1C CG+DIC 2C GAMG 1C SpeedIT GAMG 2C 0 CG+DIC 2C CG+DIC 1C SpeedIT GAMG 2C GAMG 1C 4.00 3.77x 3.50 3.5 2.88x 3.00 2.40x 3 2.68x 2.50 2.5 2.00 2 1.50 1.27x 1.5 1.00 0.76 1 0.87 0.91 0.50 0.5 0.00 0 CG+DIC 1C CG+DIC 2C GAMG 1C GAMG 2C CG+DIC 1C CG+DIC 2C GAMG 2C GAMG 1C Fig. Execution times of 50 time steps (up) and acceleration factor for simpleFoam (left) and pisoFoam (right). Performed at Intel E8400@3GHz and GTX460. 1C and 2C stand for one, two cores respectively.
  • 8. OpenFoam multi-GPU ¡ Motivation: large geometries ( 8M celss) do not fit to a single GPU 
 card (with max. 6GB RAM). ¡ OpenFOAM performed the decomposition. MPI was used for communication between GPU cards. ¡ Tests were performed at IBM PLX CINECA with 2 six-cores Intel Westmere 2.40 GHz and 2 Tesla M2070 per node (supported by NVIDIA). One thread manages one GPU card. ¡ Tests focused on multi-GPU precondtioned CG with AMG to solve pressure equation in steady-state flows.
  • 9. OpenFoam multi-GPU 6000 5000 CPU GPU 4000 CPU, PCG with GAMG 3000 CPU, PCG with diagonal x1.59 x1.63 SpeedIT 2000 1353 1207 x1.62 x1.15 1000 699 394 848 736 429 342 0 6 8 16 32 Fig. Time in sec. of first 10 time steps for N GPUs and CPU cores. Motorbike test, simpleFoam with diagonal (blue) and GAMG (green) preconditioners, SpeedIT is marked in red. Geometry is 32 millions cells. Tests were performed at PLX CINECA.
  • 10. Scaling of OpenFOAM multi-GPU ¡ Complex car geometry, external steady state flow, simpleFOAM, 32M cells, pressure: GAMG, velocity: smoothGS. SpeedIT multiGPU Ideal Scaling 12.00 140000 11 120000 10.00 9.23 100000 8.00 8 7.11 80000 6.00 60000 4.00 4 40000 3.63 20000 2.00 2 1.93 0 0.00 8 16 32 64 88 8 16 32 64 88 Fig. Execution times of 2950 time steps (left) and scaling relative to 8 GPU (right).
  • 11. OpenFOAM GPU for industries
 Authors: Saeed Iqbal and Kevin Tubbs Full report can be obtained at Dell Tech Center . Figure 1: OpenFOAM performance of 3D cavity case using 4 million cells on a single node.
  • 12. OpenFOAM GPU for industries
 Authors: Saeed Iqbal and Kevin Tubbs Figure 2: Total Power and Power Efficiency of 3D cavity case on 4 million cells on a single node.
  • 13. OpenFOAM GPU for industries
 Authors: Saeed Iqbal and Kevin Tubbs Figure 1: OpenFOAM performance of 3D cavity case using 8 million cells on a single node.
  • 14. OpenFOAM GPU for industries
 Authors: Saeed Iqbal and Kevin Tubbs Figure 2: Total Power and Power Efficiency of 3D cavity case on 4 million cells on a single node.
  • 15. SpeedIT 3.0 What’s new: ¡ New ILU based preconditioner. ¡ Support for Keppler NVIDIA cards. ¡ much higher bandwidth (300GB/s). ¡ 5x faster intra-GPU communication thanks to GPU Direct 2.0. ¡ Release in the begining of 2013.
  • 16. Acknowledgments ¡ Saeed Iqbal and Kevin Tubbs (DELL) ¡ Stan Posey, Edmondo Orlotti (NVIDIA) ¡ CINECA, PRACE Vratis Ltd. Muchoborska 18, Wroclaw, Poland Email: lukasz.miroslaw@vratis.com More information: speed-it.vratis.com, vratis.com/blog