LEGaTO: Software Stack Programming Models

HiPEAC CSW Autumn 2020
The LEGaTO project has received funding from the European Union's Horizon 2020 research and
innovation programme under the grant agreement No 780681
16.10.2020
LEGaTO:
Software Stack
Programming Models
HiPEAC 2020
Computer Systems Week
16-10-2020
Pascal Felber
University of Neuchatel

HiPEAC CSW Autumn 2020HiPEAC CSW Autumn 2020
Outline
• Programming models in LEGaTO’s big picture
• Common programming model for different targets
• Energy efficiency
• High-level dataflow hardware description language
• Kernel identification and dataflow engine mapping
• Fault tolerance and security

LEGaTO big picture
LEGaTO
aspects
Smart Home
USE CASES
PROGRAMMING
MODEL
COMPILER &
HLS
RUNTIME
MIDDLEWARE
HARDWARE
Smart City
Secure IOT Gateway
Sequential Task-Based OmpSs programs
C and HLS SourceCode
CPU/GPU Binaries Bitstream
Deployment, Monitoring, Control
OpenStack Middleware
C Source Code RTL
Native compiler and Linker FPGA Synthesis
Runtime
Microserver
Hardware
Platform
XiTAO Runtime
MercuriumCompilation XiTAO Front-End SCONE Compiler MaxCompiler AutoAit DFiant HLS
Machine Learning
CPU
Node Composition Redfish API Monitoring and Control REST API
GPU FPGA/DFE
Healthcare
SecurityProgrammabilityEnergy- Efficiency Fault - tolerance
SCONE Runtime HEATSNanos Runtime
Fault-Tolerance
Interface
OmpSs Eclipse IDE Plug-In

Main achievements
• Programming model, annotations
• Compiler support for OmpSs-2 with GPUs and FPGAs,
annotated task model, LLVM code generation
• IDE plugin for Eclipse
• Task groups, resource partitioning
• Energy efficiency in task scheduling (XiTao, HEATS, DiAS)
• DFiant high-productivity HDL
• Mapping of OmpSs tasks onto MaxJ
• Fault-tolerance through compiler-based error detection,
co-scheduling, checkpointing, secure task execution in TEE

Towards a single source for any
target
• New architectures continue to appear
− Common programming model
− Increase programmers’ productivity
− Develop once → run everywhere
• Performance and energy efficiency
• Key concept behind OmpSs
− Sequential task based program on single address/name space +
directionality annotations
− Executed in parallel: automatic runtime computation of
dependences among tasks
− LEGaTO: extend tasks with resource requirements, propagate
through the stack to find the most energy efficient solution at
run time

Front-end tool box
Front-end

OmpSs with SMP, OpenCL and FPGA
#pragma omp target device(smp) copy_ deps
#pragma omp task depend(in:a, b) depend(inout:c)
void matrix_ multiply(float a[BS][BS], float b[BS][BS],
float c[BS][BS]);
SMP
GPGPU or
OpenCL FPGA
#pragma omp target device(opencl) ndrange(2, NB, NB, 16, 16)
implements(matrix_ multiply)
_ _ kernel void matrix_ multiply_ opencl(float a[BS][BS],
float b[BS][BS],
float c[BS][BS]);
FPGA
#pragma omp target device(fpga) implements(matrix_ multiply)
num_ instances(3)
void matrix_ multiply_ fpga(float a[BS][BS], float b[BS][BS],
float c[BS][BS]);

OmpSs with FPGA experiments
IPs configuration 1*256, 3*128 Number of instances * size
Frequency (MHz) 200, 250, 300 Working frequency of the FPGA
Number of SMP cores SMP: 1 to 4
FPGA: 3+1 helper, 2+2 helpers
Combination of SMP and helper
threads
Number of FPGA helper threads SMP: 0; FPGA: 1, 2 Helper threads are used to manage
tasks on the FPGA
Number of pending tasks 4, 8, 16 and 32 Number of tasks sent to the IP cores
before waiting for their finalization

IDE plug-in
• OpenMP and OmpSs support in Eclipse
− Support for most of the programming models
directives and clauses
− Including
small help
descriptions
− Based on
context, auto-
completion

DFiant HDL
• Aims to bridge the programmability gap by combining constructs
and semantics from software, hardware and dataflow languages
• Programming model accommodates a middle-ground between
low-level HDL and high-level sequential programming
High-Level Synthesis
Languages and Tools
(e.g., C and Vivado HLS)
Register-Transfer
Level HDLs
(e.g., VHDL)
DFiant: A Dataflow HDL
 Automatic pipelining
 Not an HDL
 Problem with state
 Separating timing
from functionality
 Concurrency
 Fine-grain control
 Automatic pipelining Concurrency
 Fine-grain control
 Bound to clock
 Explicit pipelining

Task-based kernel identification for
DFE mapping
• OmpSs identifies “static”
task graphs while running
• Annotation
of I/O and
compute help
to create DFE
task model
• Instantiate static,
customized, ultra-deep
(>1,000 stages) computing
pipelines

Static task groups
1
Cluster static subgraphs into macrotasks targeting FPGA
execution and/or elastic multicore scheduling (XiTAO)
Static
subgraph Final
graph
with static
macrotask
#pragma oss task in(…) out(…)
2
4
22
33
#pragma oss taskgroup num_threads(auto)

XiTAO data parallel nodes
• Energy Efficient
− Data parallel nodes hide internal task parallelism and can be
scheduled with XiTAO’s energy efficient scheduler
• Programmable
− C++ based interface, and requires minimal application code
changes
• Task/Data Parallel
− Easy and intuitive nesting of data parallel nodes in a coarser
TAO-DAG
• Granularity/Slackness Control
− User-level control on the granularity of internal parallelism
(control of the BLOCK_LENGTH for dynamically scheduled TAOs)

26.11.2019 13
Energy efficiency for large jobs
HEATS: heterogeneity- and
energy-aware task scheduling
• Exploit the requirements of a
given task to identify the most
efficient configuration of nodes
• Monitoring tasks and nodes in
real time to perform the best
fitting placement and
migrations when necessary
• Prototype in Kubernetes

Energy efficiency for large jobs
• Big data production systems usually implement priority
scheduling
− Job streams with different characteristics, latency
requirements
− Jobs with varying numbers tasks
− High-priority jobs are promptly served with little queueing
− Low-priority jobs suffer from repetitive evictions
− Pre-emptive priority scheduling = significant resource waste
• DiAS: differentially approximate and sprint CPU frequency
• DiAS improves the latency for all priorities and eliminates waste
from re-executing the evicted low-priority jobs

Fault tolerance and security
• Fault tolerance front-end (compiler annotations)
− Initial work that translates pragma annotations to FTI
API calls
#pragma chk init Initialize the fault tolerance interface (FTI) library
#pragma chk load(data-expr-list) Protect variables in expr-list & recover from file
#pragma chk store(data-expr-list) Protect variables in expr-list & create a checkpoint file
#pragma chk shutdown Finalize/de-allocate the internal FTI data structures
• Fault tolerance back-end
− Implemented incremental checkpoint on FTI,
to be used to partially update checkpoint files
− Implemented partial recovery from checkpoint files,
to be used on recovery to extract output data of a
task from the checkpoint file

SCONE platform
• Enables native applications to run inside Intel SGX
enclaves without code changes
• Transparently attests applications
• Supports network and file system shields
• Manages secrets and configuration
• Supports secure multi-stakeholder machine learning
computations:
− Code, data, and models are encrypted
− Tensorflow, PyTorch, OpenVino, OpenCV, etc.

Summary
• Heterogeneity
− Integrated programming model around OmpSs
• Energy efficiency
− XiTAO scheduling
• Programming models in LEGaTO’s big picture
• Common programming model for different targets
• High-level dataflow hardware description language
• Kernel identification and dataflow engine mapping
• Fault tolerance and security

LEGaTO: Software Stack Programming Models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to LEGaTO: Software Stack Programming Models

Similar to LEGaTO: Software Stack Programming Models (20)

More from LEGATO project

More from LEGATO project (20)

Recently uploaded

Recently uploaded (20)

LEGaTO: Software Stack Programming Models

Editor's Notes