Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Extreme‐Scale Parallel Symmetric
Eigensolver for Very Small‐Size
Matrices Using A Communication‐
Avoiding for Pivot Vectors
Takahiro Katagiri
(Information Technology Center, The University of Tokyo)
Jun'ichi Iwata and Kazuyuki Uchida
(Department of Applied Physics School of Engineering,
The University of Tokyo)
Thursday, February 20, Room: Salon A, 10:35‐10:55
MS34 Auto‐tuning Technologies for Extreme‐Scale Solvers ‐ Part I of III
SIAM PP14, Feb.18‐21, 2014, Marriott Portland Downtown Waterfront, Portland, OR., USA

Outline
• Target Application: RSDFT
• Parallel Algorithm of Symmetric
Eigensolver for Small Matrices
• Performance Evaluation with 76,800
cores of the Fujitsu FX10
• Conclusion

RSDFT (Real Space Density Functional Theory)RSDFT (Real Space Density Functional Theory)
)()(
)(
][)(
2
1 2
rr
rrr
r
r jjj
XC
ion
E
dv 













 
Kohn-Sham equation is solved as a
finite-difference equation
J.-I. Iwata et al., J. Comp. Phys. 229, 2339 (2010).
10648-atom cell of Si crystal and its electron density
Volume of Si crystal
vs. Total Energy
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
18 18.5 19 19.5 20 20.5 21
Energy/atom(eV)
Volume/atom
10648 atoms
21952 atoms
Volume / atom
Energy/atom(eV)
10,648 atoms
21,952 atoms
Structural properties of Si crystal

Requirements of
Mathematical Software from RSDFT
• An FFT‐free algorithm.
• All eigenvalues and eigenvectors computation for
a dense real symmetric matrix.
– Standard Eigenproblem.
– O(100) times are executed for SCF (Self Consistent Field) process.
• Re‐orthogonalization for eigenvectors.
• Due to computational complexity, the parts of eigensolver
and orthogonalization become a bottleneck.
– Since these parts require O(N3) computations, while others require O(N2)
computations.
• Matrix and eigenvalues are distributed to obtain
parallelism for the other parts to eigensolver.
– It is difficult to obtain while data even if it is small.

Requirements of
Mathematical Software from RSDFT (Cont’d)
• Other parts of the eigensolver in application are also time‐consuming.
Source: Y. Hasegawa et.al.: First‐principles calculations of electron states of
a silicon nanowire with 100,000 atoms on the K computer, SC11, (2011)
Processes Execution Costs to whole time [%] Order
SCF 99.6% O(N3)
SD 47.2% O(N3)
Subspace Diag. 44.2% O(N3)
MatE 10.0% O(N3) DGEMM
Eigensolve 19.6% O(N3)
Rot V 14.6% O(N3)
CG (Conjugate Gradient) 26.0% O(N2)
GS (Gramm‐Schmidt Ort.) 25.8% O(N3) DGEMM
Others 0.6% ‐
RSDFT Processes Breakdown
Eigensolve and GS Parts will be
bottleneck in large‐scale computation,
but other processes is needed
to be considered.
• Required memory space is also needed to be considered.
– Due to API of numerical library, such as re‐distribution of data, actual problem
size is limited as small sizes with respect to remainder memory space.

Our Assumption
• Target : The eigensolver part in RSDFT
• Exa‐scale computing: Total number of nodes is
on the order of 1,000,000 (a million).
• Since the matrix is two‐dimensional (2D),
the size of the matrix required in exa‐scale computers
reaches the order of:
10,000 * sqrt (1,000,000) = 10,000,000 (ten millions),
if each node has matrix of N=10,000 .
• Since most dense solvers require O(N3) for
computational complexity, the execution time
with a matrix of
N=10,000,000 (ten millions) is unrealistic
in actual applications (in production‐run phase).

Our Assumption (Cont’d)
• We presume that N=1,000 per node is the
maximum size. The size in exa‐scale is on the
order of N=1,000,000 (a million).
• The used memory size of a matrix per node is
only on the order of 8 MB.
– ! This is eigensolver part only.
• This is just the cache size for current CPUs.
– Next generation CPUs may be having order of
100MB cache!
• Such as the IBM Power8 with e‐DRAM (3D Stacked Memory)
for L4 cache.

Originalities of Our Eigensolver
1. Non‐blocking Computation Algorithm
 Since data in cache in our assumption in exa‐scale
computing.
2. Communication reducing and
communication avoiding algorithm
 Tridiagonalization and Householder inverse
transformation of symmetric eigensolvers.
 By duplicating Householder vectors.
3. Hybrid MPI‐OpenMP execution
 With a full system of a peta‐scale supercomputer
(The Fujitsu FX10) consisting of 4800 nodes
(76,800 cores).

A Classical Householder Algorithm
(Standard Eigenproblem )xAx 
Symmetric Dense Matrix
A
1. Householder Transformation
ＱAＱ=T
Tri-diagonalization
16
)( 3
nO
T
Tridiagonal
matrix
4. Householder Inverse
Transformation
A: Dense matrix
All eigenvectors： X = ＱY
)( 3
nO
Ｑ=H1 H2 … Hn-2
2. Bisection
T: Tridiagonal matrix
All eigenvalues :Λ
3. Inverse Iteration
T : Tridiagonal matrix
All eigenvectors: Y
)(~)( 32
nOnO
)( 2
nOMRRR:

Whole Parallel Processes on the Eigensolver
A
Tridiagonalization
T
Gather
All Elements T T
T T
Upper
Lower
Compute Upper and Lower limits
For eigenvalues
1，2，3，4… (Rising Order)
Λ
1，2，3，4… （Corresponding to
Rising Order for the eigenvalues
Compute Eigenvectors
Householder Inverse Transformation
YGather
All Eigenvalues
Λ 17
2D
Cyclic‐Cyclic Distribution

Data Duplication in Tridiagonalization
19
Matrix A
:Vectors
uk , xk
uk
uk
Duplication of
ｐ Processes
ｑ Processes
uk
: Householder
Vector
:Vectors
yk,
yk
ykDuplication of

Transposed yk in Tridiagonalization (The case of p < q)
20
yk
Multi‐casting
MPI_ALLREDUCE
ｐ Processes
ｑ Processes
ｐ＝２
ｑ＝４
：Root
Processes
: With Rectangle Processor Grid [Katagiri and Itoh, 2010]
ykDuplication of
Communication
Avoiding
By Using
the Duplications

<1> do k=n－2, １, －１
<2>   Gather the vector      and  scalar
by using multiple MPI_BCASTs.
<3> do i=nstart, nend
<4>
<5>
<6>   enddo
<7> enddo
Parallel Householder Inverse Transformation
ku
ikiink
k
ink
k
uAA  ,:
)(
,:
)(
k
21
ink
kT
kki Au ,:
)(
　 

①Multi‐casting
MPI_BCAST
Gathering vector uk for Inverse Transformation
:Non-packing messages for gathering uk
22
uk
ukDuplication of
p Processes
q Processes
p = 2
q = 4
②Multi‐casting
MPI_BCAST
Communication
Avoiding
by using
the duplications

Gathering vector uk for Inverse Transformation
:Packing messages for gathering uk
23
uk
ukDuplication of
p Processes
q Processes
p = 2
q = 4
①Multi‐casting
MPI_BCAST
②Multi‐casting
MPI_BCAST
Communication
Avoiding &
Reducing
by using packing
of messages uk : Send the two vectors
by one communication
→Communication Blocking
Communication
Blocking Length = 2
uk+1

Oakleaf‐FX (ITC, U.Tokyo), The Fujitsu PRIMEHPC FX10
Contents Specifications
Whole
System
Total Performance 1.135 PFLOPS
Total Memory Amounts 150 TB
Total #nodes 4,800
Inter Connection
The TOFU
(6 Dimension
Mesh / Torus)
Local File System Amounts 1.1 PB
Shared File System Amounts 2.1 PB
Contents Specifications
Node
Theoretical Peak Performance 236.5 GFlops
#Processors (#Cores) 16
Main Memory Amounts 32 GB
Processor
Processor Name SPARC64 IX‐fx
Frequency 1.848 GHz
Theoretical Peak Performance (Core) 14.78 GFLOPS
4800 Nodes (76,800 Cores)

COMMUNICATION AVOIDING
EFFECT

(4096 Nodes (65,536 Cores), 64x64), N=38,400, Hybrid
0
10
20
30
40
50
60
70
80
90
MPI_BCAST Binary Tree MPI_Isend Block MPI_BCAST
Time in Second
Communication Implementations
Other HIT Ker Send Piv
The Best
Parameter
#Processes =4096
#Threads=16/node
Comm. Block =12
Non‐packing Sending Packing Sending
1.57x
Non‐blocking MPI

Pure MPI vs. Hybrid MPI‐OpenMPI
(64 Nodes (1024 Cores)), N=4800, Total Time
0
0.5
1
1.5
2
2.5
3
3.5
16x64 (Pure MPI) 8x8 (Hybrid MPI)
Time in Second
Process Organization
Householder Inv
Calculating Eigenvectors
Re‐distribution
Tridiagonalization
1.61x
64 MPI Processes,
16 OMP Threads/MPI Process

(64 Nodes (1024 Cores)), N=4800, Tridiagonalization
0
0.5
1
1.5
2
2.5
Time in Second
Other Update
MatVec MatVec Reduce
Send xt Send yt
Send Piv
Communication
Computation
27.9%
46.1%72.1%
53.9%18.2 Points
Reduction

(64 Nodes (1024 Cores)), N=4800,
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time in Second
Other
HIT Ker
Send Piv
Communication
Computation
15.6%
44.6%
84.4%
55.4%29 Points
Reduction

FX10 76800 CORES (4800 NODES)
RESULTS

Hybrid MPI‐OpenMP Execution
in 4800 nodes (76,800 Cores) (40x120)
31.8 83.3
429.9
34.3
180.1
904.0
0
200
400
600
800
1000
1200
1400
1600
N=41568 N=83138 N=166276
Time in Second
Householder Inv
Calculating Eigenvec
Re‐dist
Tridiag
HIT comm. block=6
HIT comm. block=4
HIT comm. block=2
2.61x
5.24x5.16x
5.01x
3.97x
5.05x
Inner L1
Cache Size
Only 4x increase
with 2x problem size
in O(N3) algorithm

Execution Time in Pure MPI
between ScaLAPACK PDSYEVD and Ours
ScaLAPACK (version 1.8) on the Fujitsu FX10. Fujitsu Optimized BLAS is used.
The best block size is specified for each ScaLAPACK execution in range between
1, 8, 16, 32, 64, 128, and 256.
4.26
10.96
25.76
1.79
4.61
15.52
0
5
10
15
20
25
30
N=4800 (8x8) 64
cores
N=9600 (16x16) 256
cores
N=19200 (32x32)
1024 cores
ScaLAPACK
Ours
[Time in Seconds]
Better

Conclusion
• Our eigensolver is effective for very small matrices to
utilize communication reducing and avoiding
techniques.
– By halving duplicate Householder vectors in
Tridiagonalization and Householder Inverse
Transformation phases.
– By using reduced communications for multiple sending
with 2D splitting for process grid.
– By using packing messages for Householder Inverse
Transformation part.
• Selection of implementations in communication
processes is the target of AT.
– The best implementation depends on process grids, the
number of processors, and block size for data packing.

Conclusion (Cont’d)
• One of drawbacks is increase of memory space.
– , where process grid is p * q.
– Since memory space for matrix is in cache size, the
increase of memory space can be ignored.
• Comparison with new blocking algorithms is
future work.
– 2‐step method with block Householder
tridiagonalization.
• Eigen‐K (Riken)
• ELPA (Technische Universität München)
• A new implementation of PLASMA and MAGMA
)/( 2
pNO

Acknowledgements
• Computational resource of Fujitsu FX10
was awarded by
“Large‐scale HPC Challenge” Project,
Information Technology Center,
The University of Tokyo.
This topic was submitted to Parallel Computing.
(As of December 2013.)

Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (14)

Similar a Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Similar a Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors (20)

Último

Último (20)

Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors