Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lensing Software

Lenstool-HPC
From scratch to supercomputers: building a large-scale
strong lensing computational software bottom-up
HPC Advisory Council, April 2018
Christoph Schäfer and Markus Rexroth (LASTRO)
Gilles Fourestey (SCITAS)

Gravitational lensing Einstein ring (credit: Nasa/Hubble)

Light refraction caused by a distribution
of matter according to Albert Einstein's general
theory of relativity (1916)
- Article about star GR in 1936
(A. Einstein, Science)
- Fritz Zwicky posited in 1937 that the
effect could allow galaxy clusters to act
like lenses
- First observed in 1979 "Twin QSO" SBS
0957+561
Gravitational lensing
Twin QSO (center), Credit: ESA/Hubble & NASA

Gravitational lensing
Optical artefacts created by dense
mass distributions
- Galaxies
- Dark matter
- Black holes
Parametric Lens Model
- the ellipticity of the projected mass distribution
- ω the finite core radius
- 0
the normalized surface mass density
- (x,y) the lens position...
“Reverse engineer” the lenses:
Recompose far-away objects by computing
the lenses’ mass
Typical search space dimension: >1010
https://briankoberlein.com/2014/08/01/bend-like-newton/
Using the “vanilla” version of lenstool requires months to find the optimal solution!

Lenstool-HPC Motivations
Lenstool-HPC was developed:
- based on Lenstool (Pr. Kneib et al., from 1996 onwards),
- In 6 man-month FTE,
- By two field scientists and one application expert,
- bottom-up from scratch.
No separation of concern:
- Field scientists define algorithmic constrains at every step
- Computer scientists provide the most optimized implementation on
specific hardware
- eXtreme(-ish) programming
Performance is scaled bottom-up:
- Focus on algorithms/kernels and data structures
- Performance scaling from core to full machine

Formalism
Source’s position on the source plane: The lens equation:
2D lensing potential:
Example of gradient (SIS):

Strong Lensing Algorithm - Step 0
Ellipticity of the projected
mass distribution
Finite core radius
normalized surface mass density Position
Given a parametric model for all the lens types:
Step 0: Compute all the gradients (~90% of TTS) - DLP
for each pixel of the image
Mapping algorithms to the hardware:
- High performance data structures (SOA)
- Implicit and Explicit (hand-coded) vectorization

Strong Lensing Algorithm - Step 1
unlensing
relensing
Given a parametric model for all the lens types
Step 0: Compute all the gradients
Step 1a: unlensing (linear transformation) - TLP
- lensing the green dots (images) to the Source plane (yellow dot)
- Compute the barycenter of the yellow dots
Step 1b: relensing (non-linear transformation) - TLP
- Decompose the Image plane into triangles
- Lense the triangles to the Source plane
- If the lensed triangle includes the barycenter, a predicted image
is found (red triangles in Image plane)

Strong Lensing Algorithm - Step 2 & 3
Given a parametric model for all the lens types
Step 0: Compute all the gradients
Step 1a: unlensing (linear transformation)
- lensing the green dots (images) to the Source plane (yellow dot)
- Compute the barycenter of the yellow dots
Step 1b: relensing (non-linear transformation)
- Decompose the Image plane into triangles
- Lense the triangles to the Source plane
- If the lensed triangle includes the barycenter, a predicted image
is found (red triangles in Image plane)
Step 2: (MPI)
- Compute
Step 3: Pass the Chi2
to a Bayesian MCMC code (MPI)
- Restart with new set of parameters until “close” to reality

Strong Lensing Algorithm
Performancescaling

Gradient Benchmark Results (Step 0)
*AVX2: Broadwell Intel Xeon CPU E5-2630 v4 @ 2.20GHz, intel compilers 17
*AVX512F: Intel Xeon Phi CPU 7210 @ 1.30GHz, intel compilers 17
Gradient benchmark computation: 5000x5000 pixels image, 69 sources, 203 constraints
AVX2* AVX512F*
Code TTS Factor TTS Factor
Lenstool 6.8.1 1.0s 1X 4.8s 1X
LenstoolHPC AOS 0.8s 1.3X 5.6s 0.9X
LenstoolHPC SOA 0.5s 2.0X 3.3s 1.4X
LenstoolHPC SOA + DLP 0.2s 4.5X 0.4s 11.4X
Performance on Broadwell:
- IACA: ~ 6 Flops/cycle
- Intel Advisor: ~25% of peak

Distributed Grid Gradient
Grid Gradient computation distribution
(step 1):
- Images split into regular
subdomains with MPI
- Subdomains are handled using
OpenMP/CUDA

Grid Gradient Benchmark (Step 1)
Single node Grid Gradient benchmark computation: 6000x6000 pixels image, 69 sources, 203 constraints.
- TLP is giving the best bang for your bucks
- SOA alone gives a nice boost (and is mandatory for efficient DLP)
- DLP is getting better with wider vector sizes (avx512 is ~2x avx2).
- V100 is much faster than P100
Grid Gradient benchmark
(TTS, in s)
AVX2 AVX512 SIMT
2630v4
2695v4
(PizDaint)
SKL Plat.
8170 HT
KNL
(greina)
P100 -
(greina) V100
lenstool 6.8.1 (TLP) 11.5 9.3 8.6 10.7 NA NA
lenstool-HPC (SOA + TLP) 5.6 2.1 1.8 5.8 NA NA
lenstool-HPC (SOA + TLP +
DLP/SIMT) 3.0 1.67 0.72 0.84 0.68 0.24

Chi2
computation
The blue dots correspond to the same image
in the source plane
- Each distance for the same source
(in blue) are reduced to Rank 0 using
MPI_Pack
- The Chi2 is computed on Rank 0
The Chi2
is computed by computing the distance between the original images and their computed unlensed/relensed projections
from steps 1a and 1b

Daint-GPU: Chi2 (Step 2) Strong Scaling
Num. nodes Grid Gradient Comp Quadrant unlensing MPI reduction TTS
1 1.39 24.8 0 26.2
2 0.83 12.4 0.00005 13.3
4 0.54 6.21 0.00006 6.79
8 0.41 3.12 0.00011 3.57
16 0.34 1.57 0.00034 1.96
32 0.3 0.81 0.00065 1.14
64 0.28 0.4 0.00133 0.77
128 0.27 0.33 0.00275 0.66
256 0.28 0.17 0.00567 0.56
512 0.3 0.12 0.01251 0.61
Scalability of the Chi2 benchmark using a 8k x 8k image, 69 sources, 203 constraints on Piz Daint multicore, 1 MPI process and 18 threads per socket, in seconds

Daint-MC: Chi2 (Step 2) Strong Scaling
Num. nodes Grid Gradient Comp. Quadrant unlensing MPI reduction TTS
1 10.51 19.25 0.00 29.83
2 5.24 10.11 0.06 15.45
4 2.74 4.87 0.11 7.75
8 1.41 2.51 0.01 3.95
16 0.75 1.41 0.03 2.20
32 0.43 0.72 0.01 1.17
64 0.24 0.37 0.01 0.63
128 0.14 0.20 0.02 0.37
256 0.14 0.12 0.04 0.31
512 0.45 0.09 0.14 0.69
This represents a 50X compared to Lenstool 6.8.1 in 6 months FTE
Scalability of the Chi2 benchmark using a 8k x 8k image, 69 sources, 203 constraints on Piz Daint multicore, 1 MPI process and 18 threads per socket, in seconds

Current Status and Next Steps
Development:
- Code on c4science, with unit tests for each kernels (lensing, unlensing, Chi2…)
- Large development project on CSCS’ Piz Daint
- Aries network tuning
- GPU tuning: lensing, unlensing and chi computation are (very) regular
- Development a parallel MCMC framework, could lead to a 500X speedup, e.g.
- Pi4u: http://www.cse-lab.ethz.ch/research/projects/pi4u/ (P. E. Hadjidoukas et al., ETHZ)
Papers:
High Performance Computing for gravitational lens modeling: single vs double precision on GPUs and CPUs
Markus Rexroth, Christoph Schafer, Gilles Fourestey, Jean-Paul Kneib
To be submitted
High Performance Strong Lensing Map Generation for Lenstool
Christoph Schafer, Gilles Fourestey, Jean-Paul Kneib
In Preparation

Lensing Map Generation
Maps based on second derivative of lensing potentials
(Mass, Amplification, Shear)
● Used for calculation of statistical errors of the MCMC
method
○ Sampling of parameter space
○ Compute average and standard deviation for every pixel
○ Added to best prediction, gives asymmetric error bars
● Fast Map generation crucial
○ Actual process takes months
Grid Gradient 2 benchmark TTS, in s Speedup
lenstool 6.8.1 765
lenstool-HPC 1.3 x567
Single node Grid Gradient benchmark computation: 4200x4200
pixels image, 201 individual lenses.
- Lenstool: Intel(R) Xeon(R) CPU E5-1620 v3 @
3.50GHz
- Lenstool HPC: P100

- Thanks to Pr. Jean-Paul Kneib (LASTRO, EPFL), Pr. Jan Hesthaven and Dr. Vittoria
Rezzonico (SCITAS, EPFL)
- Thanks to Colin McMurtrie and Hussein El-Harake from CSCS for their support using the
CSCS’ test cluster
Brownie points

Questions?
gilles.fourestey@epfl.ch https://scitas.epfl.ch/
christophernstrerne.schaefer@epfl.ch https://lastro.epfl.ch/
markus.rexroth@epfl.ch

Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lensing Software

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Similar a Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lensing Software

Similar a Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lensing Software (20)

Más de inside-BigData.com

Más de inside-BigData.com (20)

Último

Último (20)

Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lensing Software