The SpeedIT provides a partial acceleration of sparse linear solvers. Acceleration is achieved with a single reasonably priced NVIDIA Graphics Processing Unit (GPU) that supporst CUDA and proprietary advanced optimisation techniques.
Check also SpeedIT FLOW, our RANS single phase flow solver that runs fully on GPU: vratis.com/blog
2. Attempt I: SpeedIT
¡ GPU-based library of iterative solvers for Sparse Linear
Algebra and CFD.
¡ Current version: 2.2. Version 1.0 in 2008.
¡ CMRS format for fast SpMV and other storage formats.
¡ BiCGSTAB and CG iterative solvers.
¡ Preconditioners: Jacobi, AMG.
¡ Runs on one or more GPU cards.
¡ OpenFOAM compatibile.
¡ Possible Applications:
¡ Quantum Chemistry,
¡ Seminconductor and Power Network Design
¡ CFD (OpenFOAM)
¡ etc.
3. SpeedIT and OpenFOAM
Plugin to OpenFOAM
¡ GPL-based generic Plugin to
¡ Provides conversion between CSR OpenFOAM.
and OpenFOAM LDU data formats.
¡ Easy substitution of CG/BCG with ¡ Any acceleration toolkit can be
their GPU versions.
integrated now.
1. Edit system/controlDict file and add libSpeedIT.so
¡ OpenFOAM source code is 2. Edit system/fvSolution and replace CG with CG_accel
unchanged due to a plugin 3. Compile the Plugin to OpenFOAM (wmake)
concept. No recompilation of 4. Complete library dependencies: $FOAM_USER_LIBBIN
should contain GPU-based libraries.
OpenFOAM is necessary.
4. 0.2
10 100 1000 1 10 100 1000
µ µ
SpeedIT vs. CUDA 5.0
nline) The speedup of our CMRS implementation of
gainst scalar, vector and HYB kernels as a function
Figure 2: (color online) The speedup of our CMRS implem
against the best of all five alternative SpMV implementat
ber of nonzero elements per row (µ). Results for given matrix.
are omitted for clarity. SpMV is a key operation in scientific software.
3.5
vector 1.4
rid kernels give the shortest SpMV times
10 scalar 3
HYB
their efficiency decreases as µ is increased, 1.2
2.5
r kernel being very inefficient for large µ.
CMRS speedup
CMRS speedup
1
of the scalar kernel is related to its inability
speedup
2
a transfers if µ is large. The vector kernel 0.8 2
1.5
the opposite way: its efficiency relative to 1.5
1
very good for large µ, but it decreases as 1 0.6 1
≈ 100. This is related to the fact that the 0.5
0.5 0.4
ocesses each row with a warp of 32 threads, 0
0 2 4
ost 0.2
efficient for sparse matrices with µ far 0 0.2
0 10 20 30 401 50 60 70
10 80 90 100
Interestingly, for 10 120 the speedup of
1 µ 100 1000 100
r the vector kernel can be approximated as
µ σ µ
Fig. Speedup of our CMRS implementation against Fig. Speedup of our CMRS implementation against
a power law.
scalar, vector (CSR) and HYB kernels as a function of
Figure 3: (color (HYB/CSR formats). online) The speedup
Figure 1: (color online) The speedup of our CMRS implementation of
CUDA 5.0 online) The speedup of CMRS against
Figure 2: (color
it can be immediately seen that the CMRS (µ). plementation as a function of σ. Only matrices such that
the mean number of nonzero elements per row
the SpMV kernel against scalar, vector and HYB kernels as a function against the best of all five alternative
ly does not yield much improvement over taken into account. given matrix.
Inset: enlargement of the plot for σ ≤
of the mean number of nonzero elements per row (µ). Results for
at for µ
lp1-like matrices are omitted forto be system-
150, and tends clarity.
than the HYB format for µ 20. Hence
t the advantages of the CMRS will be most number of nonzero elements (ni ) per row
3.5
5. OpenFOAM & single GPU
Low-cost hardware
GPU cost: $200
Test cases:
¡ Cavity 3D, 512K cells, icoFoam.
¡ Human aorta, 500K cells, simpleFoam.
¡ CPU: Intel Core 2 Duo E8400, 3GHz, 8GB RAM @ 800MHz
GPU: NVIDIA GTX 460, VRAM 1GB
Ubuntu 11.04 x64, OpenFOAM 2.0.1, CUDA Toolkit 4.1,
¡ Pressure equation solved with CG preconditioned with DIC, GAMG (CPU)
and CG preconditioned with AMG (GPU).
6. OpenFOAM & single GPU
Validation
Fig.Velocity and pressure profiles along x axis for aorta (top left) and pressure profile along x axis for cavity3D (top right).
Solution for all preconditioners.
Cross-section for velocity in X and Y direction for cavity3D run with OpenFOAM and SpeedIT (lines) and from literature for
Re=100 (dotted line) and Re=400 (solid line) without turbulence model (icoFoam).
7. OpenFOAM single GPU
Performance
2000
2500
1600
2000
1200
1500
800
1000
400
500
0
CG+DIC 1C
CG+DIC 2C
GAMG 1C
SpeedIT
GAMG 2C
0
CG+DIC 2C
CG+DIC 1C
SpeedIT
GAMG 2C
GAMG 1C
4.00
3.77x
3.50
3.5
2.88x
3.00
2.40x
3
2.68x
2.50
2.5
2.00
2
1.50
1.27x
1.5
1.00
0.76
1
0.87
0.91
0.50
0.5
0.00
0
CG+DIC 1C
CG+DIC 2C
GAMG 1C
GAMG 2C
CG+DIC 1C
CG+DIC 2C
GAMG 2C
GAMG 1C
Fig. Execution times of 50 time steps (up) and acceleration factor for simpleFoam (left) and pisoFoam (right). Performed at
Intel E8400@3GHz and GTX460. 1C and 2C stand for one, two cores respectively.
8. OpenFoam multi-GPU
¡ Motivation: large geometries ( 8M celss) do not fit to a
single GPU
card (with max. 6GB RAM).
¡ OpenFOAM performed the decomposition. MPI was used
for communication between GPU cards.
¡ Tests were performed at IBM PLX CINECA with 2 six-cores
Intel Westmere 2.40 GHz and 2 Tesla M2070 per node
(supported by NVIDIA). One thread manages one GPU card.
¡ Tests focused on multi-GPU precondtioned CG with AMG
to solve pressure equation in steady-state flows.
9. OpenFoam multi-GPU
6000
5000
CPU
GPU
4000
CPU, PCG with GAMG
3000
CPU, PCG with diagonal
x1.59
x1.63
SpeedIT
2000
1353
1207
x1.62
x1.15
1000
699
394
848
736
429
342
0
6
8
16
32
Fig. Time in sec. of first 10 time steps for N GPUs and CPU cores. Motorbike test, simpleFoam with diagonal
(blue) and GAMG (green) preconditioners, SpeedIT is marked in red. Geometry is 32 millions cells. Tests
were performed at PLX CINECA.
10. Scaling of OpenFOAM multi-GPU
¡ Complex car geometry, external steady state flow, simpleFOAM, 32M
cells, pressure: GAMG, velocity: smoothGS.
SpeedIT multiGPU
Ideal Scaling
12.00
140000
11
120000
10.00
9.23
100000
8.00
8
7.11
80000
6.00
60000
4.00
4
40000
3.63
20000
2.00
2
1.93
0
0.00
8
16
32
64
88
8
16
32
64
88
Fig. Execution times of 2950 time steps (left) and scaling relative to 8 GPU (right).
11. OpenFOAM GPU for industries
Authors: Saeed Iqbal and Kevin Tubbs
Full report can be
obtained at
Dell Tech Center .
Figure 1: OpenFOAM performance of 3D cavity case using 4 million cells on a single node.
12. OpenFOAM GPU for industries
Authors: Saeed Iqbal and Kevin Tubbs
Figure 2: Total Power and Power Efficiency of 3D cavity case on 4 million cells on a single node.
13. OpenFOAM GPU for industries
Authors: Saeed Iqbal and Kevin Tubbs
Figure 1: OpenFOAM performance of 3D cavity case using 8 million cells on a single node.
14. OpenFOAM GPU for industries
Authors: Saeed Iqbal and Kevin Tubbs
Figure 2: Total Power and Power Efficiency of 3D cavity case on 4 million cells on a single node.
15. SpeedIT 3.0
What’s new:
¡ New ILU based preconditioner.
¡ Support for Keppler NVIDIA cards.
¡ much higher bandwidth (300GB/s).
¡ 5x faster intra-GPU communication thanks to GPU Direct 2.0.
¡ Release in the begining of 2013.
16. Acknowledgments
¡ Saeed Iqbal and Kevin Tubbs (DELL)
¡ Stan Posey, Edmondo Orlotti (NVIDIA)
¡ CINECA, PRACE
Vratis Ltd.
Muchoborska 18, Wroclaw, Poland
Email: lukasz.miroslaw@vratis.com
More information: speed-it.vratis.com, vratis.com/blog