SpeedIT : GPU-based acceleration of sparse linear algebra

Multi-GPU simulations in OpenFOAM with
SpeedIT technology.

Attempt I: SpeedIT
¡ GPU-based library of iterative solvers for Sparse Linear
Algebra and CFD.
¡ Current version: 2.2. Version 1.0 in 2008.
¡ CMRS format for fast SpMV and other storage formats.
¡ BiCGSTAB and CG iterative solvers.
¡ Preconditioners: Jacobi, AMG.
¡ Runs on one or more GPU cards.
¡ OpenFOAM compatibile.

¡ Possible Applications:
¡  Quantum Chemistry,
¡  Seminconductor and Power Network Design
¡  CFD (OpenFOAM)
¡  etc.

SpeedIT and OpenFOAM

Plugin to OpenFOAM

¡ GPL-based generic Plugin to
¡ Provides conversion between CSR OpenFOAM.
and OpenFOAM LDU data formats.
¡ Easy substitution of CG/BCG with ¡ Any acceleration toolkit can be
their GPU versions.
integrated now.

1.  Edit system/controlDict ﬁle and add libSpeedIT.so

¡ OpenFOAM source code is 2.  Edit system/fvSolution and replace CG with CG_accel

unchanged due to a plugin 3.  Compile the Plugin to OpenFOAM (wmake)

concept. No recompilation of 4.  Complete library dependencies: $FOAM_USER_LIBBIN
should contain GPU-based libraries.

OpenFOAM is necessary.

0.2
10 100 1000 1 10 100 1000
µ µ

SpeedIT vs. CUDA 5.0
nline) The speedup of our CMRS implementation of
gainst scalar, vector and HYB kernels as a function
Figure 2: (color online) The speedup of our CMRS implem
against the best of all five alternative SpMV implementat
ber of nonzero elements per row (µ). Results for given matrix.
are omitted for clarity. SpMV is a key operation in scientific software.

3.5
vector 1.4
rid kernels give the shortest SpMV times
10 scalar 3
HYB
their efficiency decreases as µ is increased, 1.2
2.5
r kernel being very inefficient for large µ.
CMRS speedup

CMRS speedup
1
of the scalar kernel is related to its inability

speedup
2
a transfers if µ is large. The vector kernel 0.8 2
1.5
the opposite way: its efficiency relative to 1.5
1
very good for large µ, but it decreases as 1 0.6 1

≈ 100. This is related to the fact that the 0.5
0.5 0.4
ocesses each row with a warp of 32 threads, 0
0 2 4
ost 0.2
efficient for sparse matrices with µ far 0 0.2
0 10 20 30 401 50 60 70
10 80 90 100
Interestingly, for 10 120 the speedup of
1 µ 100 1000 100

r the vector kernel can be approximated as
µ σ µ
Fig. Speedup of our CMRS implementation against Fig. Speedup of our CMRS implementation against
a power law.
scalar, vector (CSR) and HYB kernels as a function of

Figure 3: (color (HYB/CSR formats). online) The speedup
Figure 1: (color online) The speedup of our CMRS implementation of
CUDA 5.0 online) The speedup of CMRS against
Figure 2: (color
it can be immediately seen that the CMRS (µ). plementation as a function of σ. Only matrices such that
the mean number of nonzero elements per row

the SpMV kernel against scalar, vector and HYB kernels as a function against the best of all five alternative
ly does not yield much improvement over taken into account. given matrix.
Inset: enlargement of the plot for σ ≤
of the mean number of nonzero elements per row (µ). Results for
at for µ
lp1-like matrices are omitted forto be system-
150, and tends clarity.
than the HYB format for µ 20. Hence
t the advantages of the CMRS will be most number of nonzero elements (ni ) per row
3.5

OpenFOAM & single GPU 
Low-cost hardware
GPU cost: $200

Test cases: 

¡ Cavity 3D, 512K cells, icoFoam.
¡ Human aorta, 500K cells, simpleFoam.
¡ CPU: Intel Core 2 Duo E8400, 3GHz, 8GB RAM @ 800MHz 
GPU: NVIDIA GTX 460, VRAM 1GB 
Ubuntu 11.04 x64, OpenFOAM 2.0.1, CUDA Toolkit 4.1,

¡ Pressure equation solved with CG preconditioned with DIC, GAMG (CPU)
and CG preconditioned with AMG (GPU).

OpenFOAM & single GPU 
Validation

Fig.Velocity and pressure proﬁles along x axis for aorta (top left) and pressure proﬁle along x axis for cavity3D (top right).
Solution for all preconditioners.
Cross-section for velocity in X and Y direction for cavity3D run with OpenFOAM and SpeedIT (lines) and from literature for
Re=100 (dotted line) and Re=400 (solid line) without turbulence model (icoFoam).

OpenFOAM single GPU 
Performance

2000
2500

1600
2000

1200
1500

800
1000

400
500

0
CG+DIC 1C
CG+DIC 2C
GAMG 1C
SpeedIT
GAMG 2C
0
CG+DIC 2C
CG+DIC 1C
SpeedIT
GAMG 2C
GAMG 1C
4.00
3.77x
3.50
3.5
2.88x
3.00
2.40x
3
2.68x
2.50
2.5

2.00
2

1.50
1.27x
1.5
1.00
0.76
1
0.87
0.91
0.50
0.5
0.00
0
CG+DIC 1C
CG+DIC 2C
GAMG 1C
GAMG 2C
CG+DIC 1C
CG+DIC 2C
GAMG 2C
GAMG 1C

Fig. Execution times of 50 time steps (up) and acceleration factor for simpleFoam (left) and pisoFoam (right). Performed at
Intel E8400@3GHz and GTX460. 1C and 2C stand for one, two cores respectively.

OpenFoam multi-GPU
¡ Motivation: large geometries ( 8M celss) do not ﬁt to a
single GPU  
card (with max. 6GB RAM).

¡ OpenFOAM performed the decomposition. MPI was used
for communication between GPU cards.

¡ Tests were performed at IBM PLX CINECA with 2 six-cores
Intel Westmere 2.40 GHz and 2 Tesla M2070 per node
(supported by NVIDIA). One thread manages one GPU card.

¡ Tests focused on multi-GPU precondtioned CG with AMG
to solve pressure equation in steady-state ﬂows.

OpenFoam multi-GPU
6000

5000
CPU

GPU

4000

CPU, PCG with GAMG
3000
CPU, PCG with diagonal
x1.59

x1.63

SpeedIT
2000
1353
1207
x1.62

x1.15

1000
699
394
848
736
429
342
0
6
8
16
32

Fig. Time in sec. of ﬁrst 10 time steps for N GPUs and CPU cores. Motorbike test, simpleFoam with diagonal
(blue) and GAMG (green) preconditioners, SpeedIT is marked in red. Geometry is 32 millions cells. Tests
were performed at PLX CINECA.

Scaling of OpenFOAM multi-GPU
¡ Complex car geometry, external steady state ﬂow, simpleFOAM, 32M
cells, pressure: GAMG, velocity: smoothGS.

SpeedIT multiGPU
Ideal Scaling
12.00
140000
11
120000
10.00

9.23
100000
8.00
8
7.11
80000
6.00

60000

4.00
4
40000
3.63

20000
2.00
2
1.93

0
0.00
8
16
32
64
88
8
16
32
64
88

Fig. Execution times of 2950 time steps (left) and scaling relative to 8 GPU (right).

OpenFOAM GPU for industries 

Authors: Saeed Iqbal and Kevin Tubbs

Full report can be
obtained at
Dell Tech Center .

Figure 1: OpenFOAM performance of 3D cavity case using 4 million cells on a single node.



Figure 2: Total Power and Power Efﬁciency of 3D cavity case on 4 million cells on a single node.



Figure 1: OpenFOAM performance of 3D cavity case using 8 million cells on a single node.

SpeedIT 3.0

What’s new:

¡ New ILU based preconditioner.

¡ Support for Keppler NVIDIA cards.

¡ much higher bandwidth (300GB/s).

¡ 5x faster intra-GPU communication thanks to GPU Direct 2.0.

¡ Release in the begining of 2013.

Acknowledgments
¡ Saeed Iqbal and Kevin Tubbs (DELL)
¡ Stan Posey, Edmondo Orlotti (NVIDIA)
¡ CINECA, PRACE

Vratis Ltd.
Muchoborska 18, Wroclaw, Poland
Email: lukasz.miroslaw@vratis.com

More information: speed-it.vratis.com, vratis.com/blog

SpeedIT : GPU-based acceleration of sparse linear algebra

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Similar a SpeedIT : GPU-based acceleration of sparse linear algebra

Similar a SpeedIT : GPU-based acceleration of sparse linear algebra (20)

Más de University of Zurich

Más de University of Zurich (7)

Último

Último (20)

SpeedIT : GPU-based acceleration of sparse linear algebra