Designing High Performance Computing Architectures for Reliable Space Applications
1. Designing High Performance Computing
Architectures f Reliable S
A hit t
for R li bl Space
Applications
pp
Fisnik Kraja
PhD Defense
December 6, 2012
Advisors:
1st : Prof. Dr Arndt Bode
Prof Dr.
2nd : Prof. Dr. Xavier Martorell
2. Out e
Outline
1. Motivation
2.
2 The Proposed Computing Architecture
3.
3 The 2DSSAR Benchmarking Application
4. Optimizations and Benchmarking Results
p
g
–
–
–
Shared memory multiprocessor systems
Distributed memory multiprocessor systems
Heterogeneous CPU/GPU systems
5. Conclusions
2
3. Motivation
ot at o
•
Future space applications will demand for:
– Increased on-board computing capabilities
– Preserved system reliability
•
Future missions:
– Optical - IR Sounder: 4.3 GMult/s + 5.7 GAdd/s, 2.2 Gbit/s
– Radar/Microwave - HRWS SAR: 1 Tera 16 bit fixed point operations/s 603 1
operations/s, 603.1
Gbit/s
•
Challenges
–
–
–
–
–
–
Costs
Modularity
Portability
y
Scalability
Programmability
Efficiency
y
(ASICs are very expensive)
(component change and reuse)
(
(across various spacecraft p
p
platforms)
)
(hardware and software)
(compatible to various environments)
(p
(power consumption and size)
p
)
3
4. The Proposed Architecture
e oposed c tectu e
Legend:
RHMU
Radiation-Hardened
Management Unit
PPN
Parallel processing Node
Co t o us
Control Bus
Data Bus
4
5. The 2DSSAR Application
2- Dimensional Spotlight Synthetic Aperture Radar
Illuminated swath in Side‐looking Spotlight SAR
Synthetic Data
Generation (SDG):
Synthetic SAR returns
from a uniform grid of
point reflectors
Spacecraft
S
ft
Azimuth
Flight Path
SAR Sensor Processing (SSP)
Altitude
Altit d
Swath
Range
Range
Swath
Cross-Range
Read Generated Data
Image Reconstruction (IR)
Write Reconstructed Image
Reconstructed SAR
image i obtained b
i
is b i d by
applying a 2D
Fourier Matched
Filtering and
Interpolation
p
Algorithm
5
6. Profiling SAR Image Reconstruction
g
g
Coverage
g
(in km)
Memory
y
(in GB)
FLOP
(in Giga)
Time
(in Seconds)
Scale=10
3.8 x 2.5
0.25
29.54
23
Scale 30
Scale=30
11.4 7.5
11 4 x 7 5
2
115.03
115 03
230
Scale=60
22.8 x 15
8
1302
Goal:
Speedup
926
Transposition and
FFT‐shifting
2%
30x
Compression and
Decompression
Loops
7%
Interpolation
Interpolation
Loop
69%
IR Profiling
FFTs
22%
6
7. IR Optimizations for
Shared Memory Multiprocessing
• O
OpenMP
MP
– General optimizations:
• Thread Pinning and First Touch Policy
• Static/Dynamic Scheduling
– FFT
• Manual Multithreading of Loops of 1D-FFT(not the FFT itself)
– Interpolation Loop (Polar to Rectangular Coordinates)
• Atomic Operations
• Replication and Reduction (R&R)
• Other Programming Models
– OmpSs, MPI, MPI+OpenMP
MPI OpenMP
7
11. IR Optimizations for
Heterogeneous CPU/GPU Computing
blockIndex.x (bx)
• ccNUMA Multi Processor
Multi-Processor
0
1
– Sequential Optimizations
– Minor load-balancing improvements
0 1
0
1
tsize
threadInd
dex.y (ty)
blockInde
ex.y (by)
– CUDA – Tiling Technique
– cuFFT Library
– Transcendental functions
3
threadIndex.x (tx)
• Computing on CPU+GPU
• Accelerator (GP-GPU)
2
1
tsize
Block (2 1)
(2,1)
0
2
• Such as sine and cosine
– CUDA 3.2 lacks
• Some complex operations (
p
p
(multiplication and CEXP)
p
)
• Atomic operations for complex/float data
– Memory Limitation
• Atomic operations are used in SFI loop (R&R is not an option)
• Large Scale IR dataset does not fit into GPU memory
11
12. IR on a Heterogeneous Node
g
35
The Machine:
30
25
20
Speedup
ccNUMA Module
2 x 4 Cores,
16 threads
2.8 – 3.2 GHz
12 GB RAM
TDP: 95 W/CPU
PCIe 2.0 (8 GB/s)
Accelerator Module
2 GPU Cards
NVIDIA Tesla(Fermi)
(
)
1.15 GHz
6 GB GDDR5
144GB/s
TDP 238 W
15
10
5
0
CPU
CPU
CPU Best
CPU B t
CPU
CPU
16 Threads GPU
Sequential 8 Threads
(SMT)
CPU + GPU
2 GPUs
2 GPUs
2 GPU
Pipelined
Scale=10
Scale=30
1
1
1,82
1,89
14,46
11,41
16,06
13,26
20,11
19,44
18,88
22,10
4,27
16,71
15,86
25,40
Scale=60
1
1,97
10,27
12,55
20,17
24,68
22,26
34,46
12
13. Conclusions
Co c us o s
•
Shared memory Nodes
y
– Performance is limited by hardware resources
– 1 Node (12 Cores/24 Threads): speedup = 12.4
•
Distributed memory systems
– Low efficiency in terms of performance per power consumption and size.
– 8 Nodes (64 cores): speedup: 38.05
•
Heterogeneous CPU/GPU systems
– Perfect compromise:
• Better performance than current shared memory nodes
• Better efficiency than distributed memory systems
• 1 CPU + 2 GPUs: speedup: 34.46
•
Final Design Recommendations
– Powerful shared memory PPN
– PPN with ccNUMA CPUs and GPU accelerators
– Distributed memory only if multiple PPNs are needed
13