Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...
Using Many-Core Processors to Improve the Performance of Space Computing Platforms
1. Faculty of Informatics
Chair of Computer Architectures
Fisnik Kraja
Fi ik K j
Phd Candidate
2011 IEEE Aerospace Conference, 5-12 March 2011, Big Sky, Montana
2. • Subject: New computing architecture for future satellites.
• Purpose: To introduce many-core and other COTS
technologies in the design process.
• Main points will be:
– State f th
St t of the art of space applications and computing platforms
t f li ti d ti l tf
– Proposed system architecture
– Performance Estimations (Benchmarking)
– Discussions and conclusions
3/12/2011 2
3. • On-board computers offer minimal functionality.
• Constrains like power , size , heat
• High-reliability requirements, because of radiation effects:
– Total Ionizing Dose (TID)
– Single Event Upset (SEU)
– Single Event Transient (SET)
– Single Event Latch up (SEL)
Latch-up
• New space applications ask for improved on-board
processing abilities in terms of
abilities,
– high processing power and throughput
– without losing the required reliability.
3/12/2011 3
4. • HRWS SAR
(High resolution wide swath synthetic aperture radar).
• Used to reduce the amount of data to be transmitted to ground
• Uses separate apertures to transmit and receive
• Uses multiply phase centers in receive
• Each panel represents an independent phase center
• 7 Panels are used, each consisting of 12 tiles
3/12/2011 4
5. Parallelism of the algorithm:
• 7 independent panel processing
• 12x7=84 independent tile
processing
Requirements:
1 Tera 16-bit fixed point Ops/s
16 bit
(complex multiply and add)
Peak sample rate : 8Gbps
Full t
F ll antenna average raw data
d t
rate 603.1 Gbps
3/12/2011 It is impossible to fulfill these requirements 5
with currently available technology for space.
6. • To efficiently apply the upcoming many-core processors
and other COTS products to improve the on-board
processing power.
i
• Reliability of the system should be addressed by:
– traditional hardware techniques (TMR)
– software-implemented fault-tolerant techniques
• Thread/process/service replication
• This system should provide other important features:
– flexibility,
– scalability
l bilit
– portability.
3/12/2011 6
8. I/O RHPU
Memory
Memory
Memory
Reliable Local Bus
Bus interfacing
3/12/2011 8
9. • Solution to the tradeoff between performance and reliability might be the
rotating consistency check, in which only some processes are replicated
and results checked for consistency at a time, but over a longer period all of
them get verified.
3/12/2011 9
10. Why SSCA#3?
• Computationally taxing
• Large block data transfers
L bl k d t t f
• Stressful memory access patterns
• Scalable to mimic different problem sizes
1. Synthetic Data Generation stage is used to produce raw SAR
data approximates, which are similar to what would be obtained
from a real SAR system.
f l t
2. SAR Sensor Processing stage reconstructs a SAR image
using a wavefront spotlight SAR reconstruction method known as
2D F i M t h d Filt i and I t
Fourier Matched Filtering d Interpolation.
l ti
3/12/2011 10
11. SDG:
Kernel 1:
Synthetic SAR returns
Reconstructed SAR image
from a uniform grid of
point reflectors
3/12/2011 11
12. The symmetric SMA (UMA) The distributed SMA (NUMA)
– 1 Nehalem CPU: Intel Core i7 CPU 920 − 2 Nehalem CPUs: Intel Xeon CPU X5670,
– 2.67 GHz Frequency − 2.93 GHz processor frequency
– 8 MB L3 Smart Cache − 12 MB L3 Smart Cache
– 4 Cores
4 Cores (8 Threads in Hyper threading)
Hyper-threading) − 6 Cores/CPU
– 130 W power consumption − 95 W power consumption
– 24 Gigabytes of DDR3 RAM − 36(18x2) Gigabytes of DDR3 RAM
– 4.8 Giga Transfers/s QPI
g − 6.4 Giga Transfers/s QPI
g
3/12/2011 12
13. UMA-SMA NUMA-SMA
architectures offer flexibility but architectures avoid bottleneck
they tend to have memory
y y problems in memories, but require
p q
bottlenecks. manual/pinned allocation of memory
for each thread.
3/12/2011 13
14. Sequential FFT Multithreaded FFT
Parallelized Loops with OpenMP Tiling Technique
Threaded FFT using OpenMP
GOMP_CPU_AFFINITY =” 0-11”
More Private Variables
3/12/2011 14
15. Most important optimizations:
• Thread Pinning (first touch policy of memory)
• Private Data (stack, local)/Shared Data(remote cached, evicted)
(stack Data(remote, cached
• Scheduling
Static for loops with regular workloads
Dynamic for loops with non regular ones
Outlook
• The SAR data generation and image formation are scalable to
• 4 cores i UMA (U ifi d M
in (Unified Memory A
Access)
)
• 12 cores in NUMA-2x[6Cores, 16GB RAM]
• Speedup is almost linear in these SMA architectures
• This code is expected to scale to bigger numbers of cores
• Further parallelization paradigms are planed:
• MPI(Message Passing Interface) for clusters
• CUDA f GPGPUs
for GPGPU
3/12/2011 15
16. By combining many-core processors and other COTS
products with radiation-hardened specific components
one can benefit:
• A speedup by a factor of 10 to 100
• Improved reliability and robustness of the system.
• Efficient and faster application development via already familiar
programming models.
• Ability to port applications directly to the space environment.
• Minimization f the
Mi i i ti of th non-recurring d i development ti
l t time and costs f
d t for
future missions.
• Efficient, flexible and portable software fault-tolerance
techniques that can be applied in the space environment
environment.
• Portability to future advances in technology.
3/12/2011 16
17. Thank you for your attention!
Fisnik Kraja
LRR - L h t hl fü R h t h ik und R h
Lehrstuhl für Rechnertechnik d Rechnerorganisation
i ti
Technische Universität München
kraja@in.tum.de
j @
3/12/2011 17