In this deck from the 2018 Swiss HPC Conference, Gilles Fourestey from EPFL presents: Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lensing Software.
"LENSTOOL is a gravitational lensing software that models mass distribution of galaxies and clusters. It was developed by Prof. Kneib, head of the LASTRO lab at EPFL, et al., starting from 1996. It is used to obtain sub-percent precision measurements of the total mass in galaxy clusters and constrain the dark matter self-interaction cross-section, a crucial ingredient to understanding its nature.
However, LENSTOOL lacks efficient vectorization and only uses OpenMP, which limits its execution to one node and can lead to execution times that exceed several months. Therefore, the LASTRO and the EPFL HPC group decided to rewrite the code from scratch and in order to minimize risk and maximize performance, a bottom-up approach that focuses on exposing parallelism at hardware and instruction levels was used. The result is a high performance code, fully vectorized on Xeon, Xeon Phis and GPUs that currently scales up to hundreds of nodes on CSCS’ Piz Daint, one of the fastest supercomputers in the world."
Watch the video: https://wp.me/p3RLHQ-ili
Learn more: https://infoscience.epfl.ch/record/234382/files/EPFL_TH8338.pdf?subformat=pdfa
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lensing Software
1. Lenstool-HPC
From scratch to supercomputers: building a large-scale
strong lensing computational software bottom-up
HPC Advisory Council, April 2018
Christoph Schäfer and Markus Rexroth (LASTRO)
Gilles Fourestey (SCITAS)
4. Light refraction caused by a distribution
of matter according to Albert Einstein's general
theory of relativity (1916)
- Article about star GR in 1936
(A. Einstein, Science)
- Fritz Zwicky posited in 1937 that the
effect could allow galaxy clusters to act
like lenses
- First observed in 1979 "Twin QSO" SBS
0957+561
Gravitational lensing
Twin QSO (center), Credit: ESA/Hubble & NASA
5. Gravitational lensing
Optical artefacts created by dense
mass distributions
- Galaxies
- Dark matter
- Black holes
Parametric Lens Model
- the ellipticity of the projected mass distribution
- ω the finite core radius
- 0
the normalized surface mass density
- (x,y) the lens position...
“Reverse engineer” the lenses:
Recompose far-away objects by computing
the lenses’ mass
Typical search space dimension: >1010
https://briankoberlein.com/2014/08/01/bend-like-newton/
Using the “vanilla” version of lenstool requires months to find the optimal solution!
6. Lenstool-HPC Motivations
Lenstool-HPC was developed:
- based on Lenstool (Pr. Kneib et al., from 1996 onwards),
- In 6 man-month FTE,
- By two field scientists and one application expert,
- bottom-up from scratch.
No separation of concern:
- Field scientists define algorithmic constrains at every step
- Computer scientists provide the most optimized implementation on
specific hardware
- eXtreme(-ish) programming
Performance is scaled bottom-up:
- Focus on algorithms/kernels and data structures
- Performance scaling from core to full machine
8. Strong Lensing Algorithm - Step 0
Ellipticity of the projected
mass distribution
Finite core radius
normalized surface mass density Position
Given a parametric model for all the lens types:
Step 0: Compute all the gradients (~90% of TTS) - DLP
for each pixel of the image
Mapping algorithms to the hardware:
- High performance data structures (SOA)
- Implicit and Explicit (hand-coded) vectorization
9. Strong Lensing Algorithm - Step 1
unlensing
relensing
Given a parametric model for all the lens types
Step 0: Compute all the gradients
Step 1a: unlensing (linear transformation) - TLP
- lensing the green dots (images) to the Source plane (yellow dot)
- Compute the barycenter of the yellow dots
Step 1b: relensing (non-linear transformation) - TLP
- Decompose the Image plane into triangles
- Lense the triangles to the Source plane
- If the lensed triangle includes the barycenter, a predicted image
is found (red triangles in Image plane)
10. Strong Lensing Algorithm - Step 2 & 3
Given a parametric model for all the lens types
Step 0: Compute all the gradients
Step 1a: unlensing (linear transformation)
- lensing the green dots (images) to the Source plane (yellow dot)
- Compute the barycenter of the yellow dots
Step 1b: relensing (non-linear transformation)
- Decompose the Image plane into triangles
- Lense the triangles to the Source plane
- If the lensed triangle includes the barycenter, a predicted image
is found (red triangles in Image plane)
Step 2: (MPI)
- Compute
Step 3: Pass the Chi2
to a Bayesian MCMC code (MPI)
- Restart with new set of parameters until “close” to reality
13. Distributed Grid Gradient
Grid Gradient computation distribution
(step 1):
- Images split into regular
subdomains with MPI
- Subdomains are handled using
OpenMP/CUDA
14. Grid Gradient Benchmark (Step 1)
Single node Grid Gradient benchmark computation: 6000x6000 pixels image, 69 sources, 203 constraints.
- TLP is giving the best bang for your bucks
- SOA alone gives a nice boost (and is mandatory for efficient DLP)
- DLP is getting better with wider vector sizes (avx512 is ~2x avx2).
- V100 is much faster than P100
Grid Gradient benchmark
(TTS, in s)
AVX2 AVX512 SIMT
2630v4
2695v4
(PizDaint)
SKL Plat.
8170 HT
KNL
(greina)
P100 -
(greina) V100
lenstool 6.8.1 (TLP) 11.5 9.3 8.6 10.7 NA NA
lenstool-HPC (SOA + TLP) 5.6 2.1 1.8 5.8 NA NA
lenstool-HPC (SOA + TLP +
DLP/SIMT) 3.0 1.67 0.72 0.84 0.68 0.24
15. Chi2
computation
The blue dots correspond to the same image
in the source plane
- Each distance for the same source
(in blue) are reduced to Rank 0 using
MPI_Pack
- The Chi2 is computed on Rank 0
The Chi2
is computed by computing the distance between the original images and their computed unlensed/relensed projections
from steps 1a and 1b
16. Daint-GPU: Chi2 (Step 2) Strong Scaling
Num. nodes Grid Gradient Comp Quadrant unlensing MPI reduction TTS
1 1.39 24.8 0 26.2
2 0.83 12.4 0.00005 13.3
4 0.54 6.21 0.00006 6.79
8 0.41 3.12 0.00011 3.57
16 0.34 1.57 0.00034 1.96
32 0.3 0.81 0.00065 1.14
64 0.28 0.4 0.00133 0.77
128 0.27 0.33 0.00275 0.66
256 0.28 0.17 0.00567 0.56
512 0.3 0.12 0.01251 0.61
Scalability of the Chi2 benchmark using a 8k x 8k image, 69 sources, 203 constraints on Piz Daint multicore, 1 MPI process and 18 threads per socket, in seconds
17. Daint-MC: Chi2 (Step 2) Strong Scaling
Num. nodes Grid Gradient Comp. Quadrant unlensing MPI reduction TTS
1 10.51 19.25 0.00 29.83
2 5.24 10.11 0.06 15.45
4 2.74 4.87 0.11 7.75
8 1.41 2.51 0.01 3.95
16 0.75 1.41 0.03 2.20
32 0.43 0.72 0.01 1.17
64 0.24 0.37 0.01 0.63
128 0.14 0.20 0.02 0.37
256 0.14 0.12 0.04 0.31
512 0.45 0.09 0.14 0.69
This represents a 50X compared to Lenstool 6.8.1 in 6 months FTE
Scalability of the Chi2 benchmark using a 8k x 8k image, 69 sources, 203 constraints on Piz Daint multicore, 1 MPI process and 18 threads per socket, in seconds
18. Current Status and Next Steps
Development:
- Code on c4science, with unit tests for each kernels (lensing, unlensing, Chi2…)
- Large development project on CSCS’ Piz Daint
- Aries network tuning
- GPU tuning: lensing, unlensing and chi computation are (very) regular
- Development a parallel MCMC framework, could lead to a 500X speedup, e.g.
- Pi4u: http://www.cse-lab.ethz.ch/research/projects/pi4u/ (P. E. Hadjidoukas et al., ETHZ)
Papers:
High Performance Computing for gravitational lens modeling: single vs double precision on GPUs and CPUs
Markus Rexroth, Christoph Schafer, Gilles Fourestey, Jean-Paul Kneib
To be submitted
High Performance Strong Lensing Map Generation for Lenstool
Christoph Schafer, Gilles Fourestey, Jean-Paul Kneib
In Preparation
19. Lensing Map Generation
Maps based on second derivative of lensing potentials
(Mass, Amplification, Shear)
● Used for calculation of statistical errors of the MCMC
method
○ Sampling of parameter space
○ Compute average and standard deviation for every pixel
○ Added to best prediction, gives asymmetric error bars
● Fast Map generation crucial
○ Actual process takes months
Grid Gradient 2 benchmark TTS, in s Speedup
lenstool 6.8.1 765
lenstool-HPC 1.3 x567
Single node Grid Gradient benchmark computation: 4200x4200
pixels image, 201 individual lenses.
- Lenstool: Intel(R) Xeon(R) CPU E5-1620 v3 @
3.50GHz
- Lenstool HPC: P100
20. - Thanks to Pr. Jean-Paul Kneib (LASTRO, EPFL), Pr. Jan Hesthaven and Dr. Vittoria
Rezzonico (SCITAS, EPFL)
- Thanks to Colin McMurtrie and Hussein El-Harake from CSCS for their support using the
CSCS’ test cluster
Brownie points