1. HiPEAC CSW Autumn 2020
The LEGaTO project has received funding from the European Union's Horizon 2020 research and
innovation programme under the grant agreement No 780681
16.10.2020
LEGaTO:
Software Stack
Programming Models
HiPEAC 2020
Computer Systems Week
16-10-2020
Pascal Felber
University of Neuchatel
2. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
Outline
• Programming models in LEGaTO’s big picture
• Common programming model for different targets
• Energy efficiency
• High-level dataflow hardware description language
• Kernel identification and dataflow engine mapping
• Fault tolerance and security
3. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
LEGaTO big picture
LEGaTO
aspects
Smart Home
USE CASES
PROGRAMMING
MODEL
COMPILER &
HLS
RUNTIME
MIDDLEWARE
HARDWARE
Smart City
Secure IOT Gateway
Sequential Task-Based OmpSs programs
C and HLS SourceCode
CPU/GPU Binaries Bitstream
Deployment, Monitoring, Control
OpenStack Middleware
C Source Code RTL
Native compiler and Linker FPGA Synthesis
Runtime
Microserver
Hardware
Platform
XiTAO Runtime
MercuriumCompilation XiTAO Front-End SCONE Compiler MaxCompiler AutoAit DFiant HLS
Machine Learning
CPU
Node Composition Redfish API Monitoring and Control REST API
GPU FPGA/DFE
Healthcare
SecurityProgrammabilityEnergy- Efficiency Fault - tolerance
SCONE Runtime HEATSNanos Runtime
Fault-Tolerance
Interface
OmpSs Eclipse IDE Plug-In
4. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
Main achievements
• Programming model, annotations
• Compiler support for OmpSs-2 with GPUs and FPGAs,
annotated task model, LLVM code generation
• IDE plugin for Eclipse
• Task groups, resource partitioning
• Energy efficiency in task scheduling (XiTao, HEATS, DiAS)
• DFiant high-productivity HDL
• Mapping of OmpSs tasks onto MaxJ
• Fault-tolerance through compiler-based error detection,
co-scheduling, checkpointing, secure task execution in TEE
5. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
Towards a single source for any
target
• New architectures continue to appear
− Common programming model
− Increase programmers’ productivity
− Develop once → run everywhere
• Performance and energy efficiency
• Key concept behind OmpSs
− Sequential task based program on single address/name space +
directionality annotations
− Executed in parallel: automatic runtime computation of
dependences among tasks
− LEGaTO: extend tasks with resource requirements, propagate
through the stack to find the most energy efficient solution at
run time
8. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
OmpSs with FPGA experiments
IPs configuration 1*256, 3*128 Number of instances * size
Frequency (MHz) 200, 250, 300 Working frequency of the FPGA
Number of SMP cores SMP: 1 to 4
FPGA: 3+1 helper, 2+2 helpers
Combination of SMP and helper
threads
Number of FPGA helper threads SMP: 0; FPGA: 1, 2 Helper threads are used to manage
tasks on the FPGA
Number of pending tasks 4, 8, 16 and 32 Number of tasks sent to the IP cores
before waiting for their finalization
9. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
IDE plug-in
• OpenMP and OmpSs support in Eclipse
− Support for most of the programming models
directives and clauses
− Including
small help
descriptions
− Based on
context, auto-
completion
10. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
DFiant HDL
• Aims to bridge the programmability gap by combining constructs
and semantics from software, hardware and dataflow languages
• Programming model accommodates a middle-ground between
low-level HDL and high-level sequential programming
High-Level Synthesis
Languages and Tools
(e.g., C and Vivado HLS)
Register-Transfer
Level HDLs
(e.g., VHDL)
DFiant: A Dataflow HDL
Automatic pipelining
Not an HDL
Problem with state
Separating timing
from functionality
Concurrency
Fine-grain control
Automatic pipelining Concurrency
Fine-grain control
Bound to clock
Explicit pipelining
11. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
Task-based kernel identification for
DFE mapping
• OmpSs identifies “static”
task graphs while running
• Annotation
of I/O and
compute help
to create DFE
task model
• Instantiate static,
customized, ultra-deep
(>1,000 stages) computing
pipelines
13. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
XiTAO data parallel nodes
• Energy Efficient
− Data parallel nodes hide internal task parallelism and can be
scheduled with XiTAO’s energy efficient scheduler
• Programmable
− C++ based interface, and requires minimal application code
changes
• Task/Data Parallel
− Easy and intuitive nesting of data parallel nodes in a coarser
TAO-DAG
• Granularity/Slackness Control
− User-level control on the granularity of internal parallelism
(control of the BLOCK_LENGTH for dynamically scheduled TAOs)
14. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
26.11.2019 13
Energy efficiency for large jobs
HEATS: heterogeneity- and
energy-aware task scheduling
• Exploit the requirements of a
given task to identify the most
efficient configuration of nodes
• Monitoring tasks and nodes in
real time to perform the best
fitting placement and
migrations when necessary
• Prototype in Kubernetes
15. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
Energy efficiency for large jobs
• Big data production systems usually implement priority
scheduling
− Job streams with different characteristics, latency
requirements
− Jobs with varying numbers tasks
− High-priority jobs are promptly served with little queueing
− Low-priority jobs suffer from repetitive evictions
− Pre-emptive priority scheduling = significant resource waste
• DiAS: differentially approximate and sprint CPU frequency
• DiAS improves the latency for all priorities and eliminates waste
from re-executing the evicted low-priority jobs
16. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
Fault tolerance and security
• Fault tolerance front-end (compiler annotations)
− Initial work that translates pragma annotations to FTI
API calls
#pragma chk init Initialize the fault tolerance interface (FTI) library
#pragma chk load(data-expr-list) Protect variables in expr-list & recover from file
#pragma chk store(data-expr-list) Protect variables in expr-list & create a checkpoint file
#pragma chk shutdown Finalize/de-allocate the internal FTI data structures
• Fault tolerance back-end
− Implemented incremental checkpoint on FTI,
to be used to partially update checkpoint files
− Implemented partial recovery from checkpoint files,
to be used on recovery to extract output data of a
task from the checkpoint file
17. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
SCONE platform
• Enables native applications to run inside Intel SGX
enclaves without code changes
• Transparently attests applications
• Supports network and file system shields
• Manages secrets and configuration
• Supports secure multi-stakeholder machine learning
computations:
− Code, data, and models are encrypted
− Tensorflow, PyTorch, OpenVino, OpenCV, etc.
18. HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
Summary
• Heterogeneity
− Integrated programming model around OmpSs
• Energy efficiency
− XiTAO scheduling
• Programming models in LEGaTO’s big picture
• Common programming model for different targets
• High-level dataflow hardware description language
• Kernel identification and dataflow engine mapping
• Fault tolerance and security
HLS: High Level Synthesis
HDL: Hardware Description Language
AutoAit: mapping of OmpSs to Vivado
Overall software toolchain for the cluster runtime.
The LEGaTO programming model front-end is shown on the left-hand side of the figure. The LEGaTO front-end consists of the tools that process the source code and generate the LEGaTO binary targeting the heterogeneous platforms. These tools include extensions to Mercurium (previously developed by BSC) to analyze OmpSs source code and generate Nanos/XiTAO/FPGA/GPU binaries, and two high level programming methodologies to generate dataflow kernels: DFiant and MaxJ.
OmpSs Provides tasking to SMP cores
Usual scheduling policies/techniques: FIFO, Cilk, Immediate successor
“Implements” provides different targets for the same task
Kernel provided in CUDA or OpenCL
Data transfers automatically issued by OmpSs
Single source parallel programming with FPGA acceleration
“Implements” technique available
“num_instances(N)” allows to generate the indicated number of IP accelerators
AXIOM board:
Xilinx Zynq Ultrascale+ chip, with 4 ARM Cortex-A53 cores, and the ZU9EG FPGA.
DFE: Data Flow Engine
Task-based kernel identification/DFE mapping
The purpose of Task T4.6 is to identify static sub-graphs in the OmpSs task graph and map them to kernels on a Maxeler FPGA-based Dataflow Engine (DFE). The rational is that the OmpSs tasks appear naturally suitable for FPGA mapping: They have clearly defined inputs and outputs and have self-contained state. Maxeler's programming model is based on dataflow where large dataflow graphs, described in MaxJ, are mapped and optimised to generate FPGA configurations. These dataflow graphs are essentially static, highly customised and ultra-deep pipelines that achieve very high computational throughput. Generating these dataflow graphs is supported by Maxeler's MaxCompiler toolchain and runtime execution from a host application is enabled through the MaxelerOS runtime. A task-based programming model such as OmpSs is a good match to act as a front end for the dataflow graph generation. However, due to high context switching overhead of FPGAs, tasks graphs mapped to FPGAs need to be static.
SLiC: interface into Maxeler HW allowing to spawn tasks.
Granularity of internal parallelism: thread to core mapping
We have further improved HEATS by designing an updated version of it where not only migrations across heterogeneous nodes are exploited but also across the three layers of the deployment architecture (edge, fog and cloud). The prototype implementation and test of the system update is currently under development. On top of that, we have also observed performance and energy improvements when tuning the CPU frequency of the nodes. Based on this, we have developed a sprinting approach which runs jobs at a higher frequency for as long as a predefined budget is not used up. In this work we also take into account that jobs might have different priorities. The system was implemented in Go and tested using Spark. The outcome of this work has been submitted to Middleware’19 in a paper called “Differential Approximation and Sprinting for Multi-Priority Big Data Engines” which is currently under revision.
Trick: Reduce a fraction of data load for low-priority jobs Temporarily increase the CPU frequency for high-priority jobs.
Differential approximation
Controllable approximation level that discriminates among priority classes:Drop different fractions of data
Better latencies for low-priority jobs at the cost of their accuracy loss
Less latency increase for high-priority jobs
Stochastic models to control approximation and and sprinting
Adjusts the frequency levels
Accelerate high-priority jobs after temporarily waiting behind low-priority ones
Result: reduce tail latencies (90% low-pri, 60% high-pri) and energy (20%)
FTI: fault tolerance interface.
Fault Tolerance Mechanisms
In order to improve the availability of the platform we implemented a secure checkpointing mechanism. Our first version was implemented using the file system shield available in SCONE. Moreover, we added support for the systems calls vfork() and fork() in the SCONE toolchain in order to provide additional fault tolerance mechanisms such as rejuvenation which is a commonly implemented fault tolerance techniques. Providing fork support for applications running in Trusted Execution Environments is a non-trivial problem. First it requires to create a new enclave and copy the whole application state to the new enclave including running an attestation. Furthermore, the state must be consistent when using multiple enclave threads. It also requires pre-emption of non-forking threads. We have recently completed the implementation of this functionality which is currently under testing at the point of writing of this report.