We have developed a parallel eigensolver for very small-size matrices. Unlike conventional solvers, our design policy focusses on nature of non-blocking computations and reduced communications. A communication-avoiding approach for Householder pivot vectors is used to implement part of Householder inverse transformation. In addition to that, we implement some techniques for reducing communications by using non-blocking communications in tridiagonalization part. Performance of the solver with full nodes in the Fujitsu FX10 (76,800 cores) is also presented.
4. RSDFT (Real Space Density Functional Theory)RSDFT (Real Space Density Functional Theory)
)()(
)(
][)(
2
1 2
rr
rrr
r
r jjj
XC
ion
E
dv
Kohn-Sham equation is solved as a
finite-difference equation
J.-I. Iwata et al., J. Comp. Phys. 229, 2339 (2010).
10648-atom cell of Si crystal and its electron density
Volume of Si crystal
vs. Total Energy
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
18 18.5 19 19.5 20 20.5 21
Energy/atom(eV)
Volume/atom
10648 atoms
21952 atoms
Volume / atom
Energy/atom(eV)
10,648 atoms
21,952 atoms
Structural properties of Si crystal
5. Requirements of
Mathematical Software from RSDFT
• An FFT‐free algorithm.
• All eigenvalues and eigenvectors computation for
a dense real symmetric matrix.
– Standard Eigenproblem.
– O(100) times are executed for SCF (Self Consistent Field) process.
• Re‐orthogonalization for eigenvectors.
• Due to computational complexity, the parts of eigensolver
and orthogonalization become a bottleneck.
– Since these parts require O(N3) computations, while others require O(N2)
computations.
• Matrix and eigenvalues are distributed to obtain
parallelism for the other parts to eigensolver.
– It is difficult to obtain while data even if it is small.
7. Our Assumption
• Target : The eigensolver part in RSDFT
• Exa‐scale computing: Total number of nodes is
on the order of 1,000,000 (a million).
• Since the matrix is two‐dimensional (2D),
the size of the matrix required in exa‐scale computers
reaches the order of:
10,000 * sqrt (1,000,000) = 10,000,000 (ten millions),
if each node has matrix of N=10,000 .
• Since most dense solvers require O(N3) for
computational complexity, the execution time
with a matrix of
N=10,000,000 (ten millions) is unrealistic
in actual applications (in production‐run phase).
11. A Classical Householder Algorithm
(Standard Eigenproblem )xAx
Symmetric Dense Matrix
A
1. Householder Transformation
QAQ=T
Tri-diagonalization
16
)( 3
nO
T
Tridiagonal
matrix
4. Householder Inverse
Transformation
A: Dense matrix
All eigenvectors: X = QY
)( 3
nO
Q=H1 H2 … Hn-2
2. Bisection
T: Tridiagonal matrix
All eigenvalues :Λ
3. Inverse Iteration
T : Tridiagonal matrix
All eigenvectors: Y
)(~)( 32
nOnO
)( 2
nOMRRR:
12. Whole Parallel Processes on the Eigensolver
A
Tridiagonalization
T
Gather
All Elements T T
T T
Upper
Lower
Compute Upper and Lower limits
For eigenvalues
1,2,3,4… (Rising Order)
Λ
1,2,3,4… (Corresponding to
Rising Order for the eigenvalues
Compute Eigenvectors
Householder Inverse Transformation
YGather
All Eigenvalues
Λ 17
2D
Cyclic‐Cyclic Distribution
13. Data Duplication in Tridiagonalization
19
Matrix A
:Vectors
uk , xk
uk
uk
Duplication of
p Processes
q Processes
uk
: Householder
Vector
:Vectors
yk,
yk
ykDuplication of
14. Transposed yk in Tridiagonalization (The case of p < q)
20
yk
Multi‐casting
MPI_ALLREDUCE
p Processes
q Processes
p=2
q=4
:Root
Processes
: With Rectangle Processor Grid [Katagiri and Itoh, 2010]
ykDuplication of
Communication
Avoiding
By Using
the Duplications
16. ①Multi‐casting
MPI_BCAST
Gathering vector uk for Inverse Transformation
:Non-packing messages for gathering uk
22
uk
ukDuplication of
p Processes
q Processes
p = 2
q = 4
②Multi‐casting
MPI_BCAST
Communication
Avoiding
by using
the duplications
17. Gathering vector uk for Inverse Transformation
:Packing messages for gathering uk
23
uk
ukDuplication of
p Processes
q Processes
p = 2
q = 4
①Multi‐casting
MPI_BCAST
②Multi‐casting
MPI_BCAST
Communication
Avoiding &
Reducing
by using packing
of messages uk : Send the two vectors
by one communication
→Communication Blocking
Communication
Blocking Length = 2
uk+1
30. Conclusion (Cont’d)
• One of drawbacks is increase of memory space.
– , where process grid is p * q.
– Since memory space for matrix is in cache size, the
increase of memory space can be ignored.
• Comparison with new blocking algorithms is
future work.
– 2‐step method with block Householder
tridiagonalization.
• Eigen‐K (Riken)
• ELPA (Technische Universität München)
• A new implementation of PLASMA and MAGMA
)/( 2
pNO