Botany krishna series 2nd semester Only Mcq type questions
Device Data Directory and Asynchronous execution: A path to heterogeneous computing with OmpSs 2
1. The LEGaTO project has received funding from the European Union’s
Horizon 2020 research and innovation programme under the grant
agreement No 780681.
www.legato-project.eu
Device Data Directory and Asynchronous execution:
A path to heterogeneous computing with OmpSs-2
Rubén Cano, Carlos Álvarez, Daniel Jiménez-González, Xavier Martorell
Barcelona Supercomputing Center and Universitat Politècnica de Catalunya
Benchmarking
Device: GPU
Algorithm: Matrix Multiply Configuration: Matrix Size [16384]
Device: FPGA
Algorithm: Matrix Multiply Configuration: Block Size [256]
Conclusions
• Better performance than CUDA Unified-memory hardware
approach.
• Improved support for range dependences and multi-device
copies with the same performance as OmpSs runtime.
• Framework to easily adapt any new device to OmpSs-2
runtime.
References
• OmpSs-2 Programming Model: https://pm.bsc.es/ompss-2
• Nanos6 repository: https://github.com/bsc-pm/nanos6
• OmpSs@FPGA: https://pm.bsc.es/ompss-at-fpga
Depends
in(x[0; N])
out(y[0; N])
Kind (FPGA, CUDA…)
Task
Instance (Device Id)
Function (Saxpy ,matmul…)
Accelerator
Allocation Engine
(Allocates, reallocates and frees device-
memory)
Directory
Host-Device range-
based mapping cache
(Keeps track of the validity status of any
given region in any device. Can translate
addresses from host to any device, for a
given region)
Copy Engine
(Generates copies between devices)
Symbol-aware mapping
(Dependencies that are part of the same
symbol, but are non-contiguous in memory,
will have the same offsets between them in
the device mapping)
Stream
COPIES
TASK
EXECUTION
TASK
FINALIZATION
Dependency System
This work has been supported by the Ministry of Science
and Innovation, under the project "Computación de
Altas Prestaciones VIII" (PID2019-107255GB).
Problem
1
2
3
Memory management and communication of different
devices is challenging and error-prone.
Hardware approaches relax the memory model, but not
all the devices have support for these mechanisms.
These mechanisms usually are page-based, which can incur
in huge performance-degradation due to false-sharing.
Task is ready to be
executed
Hardware-accelerator selected.
Ensure the symbol validity
Set-up a symbol-translation table
to translate host-pointers to the
destination device.
If a symbol is not already valid, enqueue all the
copies from a valid address-space into the
destination device memory.
Can a software Unified Memory model be faster than current hardware-
based solutions?
Research
Proof-of-concept Solution
Device Directory
Unifies the memory model managing device
memories explicitly, and ensuring the
availability of the data before executing.
Stream
Unifies the execution-model of any device
to be an asynchronous queue of
sequential operations.
Zynq UltraScale+ 9EG
3 [256X256] matmul accelerators
IBM Power9 8335-GTH
NVIDIA V100 x1
We would like to thank Xilinx
University Program for software and
boards donations.
OmpSs OmpSs-2