SlideShare una empresa de Scribd logo
1 de 18
The LEGaTO project has received funding from the European Union's Horizon 2020 research and
innovation programme under the grant agreement No 780681
10/13/20
LEGaTO:
Software Stack
Runtimes
HiPEAC 2020
Computer Systems Week
16-10-2020
Miquel Pericas
Chalmers University of Technology
2
HiPEAC CSW Autumn 2020
• Middleware – SLURM and RedFish
• OmpSs@FPGA (Xavier)
• XiTAO:
−Introduction: XiTAO execution Model
−Energy Aware Scheduler
−Software Topologies
−Pipeline parallelism
• FPGA Undervolting
• Fault tolerance - GPU Checkpointing
Outline
HiPEAC CSW Autumn 2020
Slurm and RECS Master
• Integration of Slurm with RECS Master
o Nodes specification at slurm configuration (partitions, limits…)
o Slurm gets node specification and selects target nodes
o Allocates, joins and starts nodes
o Executes the application(s)
o Shuts-down nodes and destroys allocation
3
$ sinfo
PART… AVAIL LIMIT NODES STATE NODELIST
debug* up infinite 1 idle* pcxavim5
debug* up infinite 16 idle BB_1_[0,2-15],pcxavim6
HiPEAC CSW Autumn 2020
Slurm and RECS Master
• Slurm contacts RECS Master at job execution and
termination times
4
#!/bin/bash
#SBATCH -N 10
#SBATCH --constraint=ARM,bigLITTLE,hasGPU
#SBATCH -o test-%j.out
#SBATCH -e test-%j.err
// App invocation
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle* pcxavim5
debug* up infinite 10 alloc BB_1_[0,2-10]
debug* up infinite 6 idle BB_1_[11-15],pcxavim6
$ sbatch batch-10-bl.sh
Submitted batch job 39
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
39 debug batch-10 xavim R 0:42 10 BB_1_[0,2-10]
HiPEAC CSW Autumn 2020
Slurm and RECS Master
• Composed nodes are created using the
RECS Master webservice
• And started and stopped automatically
5
10 nodes are
turned on
6
HiPEAC CSW Autumn 2020
OmpSs@FPGA
● Offload of matrix multiplication to FPGA
#pragma omp target device(fpga) num_instances(3)
#pragma omp task in([BSIZE*BSIZE]a, [BSIZE*BSIZE]b) inout([BSIZE*BSIZE]c)
void matmulBlock(const elem_t *a, const elem_t *b, elem_t *c)
{
#pragma HLS INLINE // off
#pragma HLS array_partition variable=a cyclic factor=4
#pragma HLS array_partition variable=b cyclic factor=BSIZE/4
#pragma HLS array_partition variable=c cyclic factor=BSIZE/2
for (int k = 0; k < BSIZE; ++k) {
…
}
}
FPGA
7
HiPEAC CSW Autumn 2020
● Acceleration of matrix multiplication on FPGAs
− 4 ARM cores (OpenBLAS)
− 1 to 3 IP cores
● Block size 256x256
0
1
2
3
4
5
6
7
8
0
20
40
60
80
100
4 ARM cores 1 IP core 2 IP cores 3 IP cores
GFlops/W
GFlops
Axis Title
Matrix multiply, energy efficiency
Gflops Gflops/W
● Best performance
● 3 IP cores
● Best energy-efficiency
● 2 IP cores
OmpSs@FPGA
8
HiPEAC CSW Autumn 2020
XiTAO: Energy Aware Scheduler
• Module 1: Power Profiling
• help runtime understand CPU power consumption trends (number/type of
cores, different frequencies)
•
• Module 2: Dynamic Performance Modeling
• provide accurate prediction for future task given a set of resources
• independent of platforms and frequencies
• achieve scalablity and portablity goals
•
• Module 3: Idleness Tracing
• give the information about real-time status of cores
• put cores to ”sleep” when it is under-utilized
• sleeping time exploits backoff exponential strategy
• provide the real-time parallel slackness of active cores =>
calculation of shared board static power on each running task
•
• Module 4: Task Mapping Algorithm (Per task level)
For a given configuration (Start core, number of cores):
• Performance Tracer => Execution Time Prediction
• Power Profiles => Dynamic Power Prediction
• Power Profiles + Idleness Tracer => Static Power Prediction
• Energy Prediction = (Static Power + Dynamic Power) x Execution Time
9
HiPEAC CSW Autumn 2020
XiTAO: Energy Aware Scheduler
● 31%-74% energy
reduction than
RWS
● 19%-68% energy
reduction than
FCC
● 25%-73% energy
reduction than
LCC
Name Acronym ● Notion
Random Work Stealing
(+Sleep)
RWS
(+S)
Typical greedy scheduling (enhanced with Sleep)
Fastest Cores with
Criticality (+Sleep)
FCC
(+S)
Critical tasks are mapped to the set of cores that minimize
execution time and are not allowed work stealing, noncritical
tasks follow parent queue and only search for the best number of
cores that minimize the execution time of the task (enhanced with
Sleep)
Lowest Cost with
Criticality (+Sleep)
LCC
(+S)
The difference between LCC and FCC is that minimizing execution
time becomes minimizing parallel cost. The parallel cost means
”execution time * number of cores” (enhanced with Sleep)
Lowest Energy without
Criticality
LENC Task scheduling targets lowest energy, no need for criticality
awareness
10
HiPEAC CSW Autumn 2020
STA
train
Sched
• Mapping logical data locations to physical locations (to create a model per locality)
• The Software Topology Address (STA) is a portable key that is to
be interpreted by the XiTAO runtime to map a task to a place.
• Example: space filling order is used as an STA, transforming
coordinates to an integer for Cartesian inputs. Paper includes
other example such keys.
• This STA-to-location mapping is leveraged to model the
performance per task’s data locality
• A performance model per the (STA, task_type) tuple is created
• Energy aware model can be potentially used here.
• Example system’s elastic partitions to be used by the
model
XiTAO: Software & Hardware Topologies
11
HiPEAC CSW Autumn 2020
XiTAO: Model Validation on DAG Chain
•Adaptive resource selection (leader, width) for an
cache intensive task. Green is NUMA node where
task (depicted by STA) is initialized
•Scheduler mostly chooses widths 1 and 2 (within
the shared L2 cache)
• Adaptive resource selection (leader, width) for a
memory intensive task.
• Scheduler mostly chooses widths 12 (a socket
encapsulating 2 NUMA nodes)
• Random work-stealing behavior for compute
bound tasks while preferring larger widths
• Scalability of model running memory-bound DAG
chains. Up to 2.5x speedup with larger task count
• To validate the STA-driven
performance modeling, we
− Test on a 4-socket
AMD system (2
NUMA each)
− Print a resource
selection trace of a
chain of tasks
• The scheduler adaptively
behaves as locality-aware for
memory/cach intensive tasks,
and as a work-stealing
scheduler for compute bound
tasks
12
HiPEAC CSW Autumn 2020
XiTAO: Moldable pipelines for CNNs
on heterogenous edge devices
● A simple template tensor language to develop CNN
networks.
● XiTAO Pipelines are generated using the information
provided by language interface.
● An online training phase determines the optimal pipeline
configuration.
• Network Layer distribution among pipeline stages.
• Resource partitioning among pipeline stages
● The training is led by a search algorithm which utilizes
computational hints provided by the language interface.
13
HiPEAC CSW Autumn 2020
Network description in template language
main(){
…
Conv1 = CONV(ip, op, weights);
Conv2 = CONV(conv1, op, weights);
….
network.add(Conv1);
network.add(Conv2);
…
network.execute();
}
XiTAO: Moldable pipelines for CNNs
on heterogenous edge devices
14
HiPEAC CSW Autumn 2020
FPGA Undervolting
Problem: FPGAs are at least 10X less power-efficient than equivalent ASICs
Goal: Bridge the power-efficiency gap between ASICs and FPGAs by
Undervolting below nominal level
• Case Study: Power consumption of neural networks is a main concern
✔ Hardware acceleration: GPUs, FPGAs, and ASICs
Evaluation Setup
✔ 5 Image classification workloads
✔ 3 Xilinx UltraScale+ ZCU102 platforms
✔ 2 On-chip voltage rails
Main Results
✔ Large voltage guardband (i.e., 33%)
✔ >3X power-efficiency gain
15
HiPEAC CSW Autumn 2020
Overall Voltage Behavior
Slight variation of voltage behavior across platforms and benchmarks
❑ FPGA stops operatingCrash
❑ No performance or reliability loss
❑ Added by the vendor to ensure the
worst-case conditions
❑ Large guardband, average of 33%
Guard
band
❑ A narrow voltage region
❑ Neural network accuracy collapseCritical
16
HiPEAC CSW Autumn 2020
GPU Checkpointing with FTI
● Transparent multi-
GPU/multi-node
checkpointing
● Parallel streams to
improve I/O efficiency
● Fast checksum
calculation using GPUs
MD5 algorithm
17
HiPEAC CSW Autumn 2020
GPU Checkpointing with FTI
● Over 100x speed up
with the new GPU MD5
algorithm
● Checkpoint takes less
than 1 second
● FPGA checkpoint
implementation coming
Thank you!

Más contenido relacionado

La actualidad más candente

Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...
Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...
Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...NECST Lab @ Politecnico di Milano
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityLinaro
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARMEdge AI and Vision Alliance
 
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) Wim Vanderbauwhede
 
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...Shinya Takamaeda-Y
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Moldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesMoldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesLEGATO project
 
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...RISC-V International
 
Demosaic RTL for ISP workflow
Demosaic RTL for ISP workflowDemosaic RTL for ISP workflow
Demosaic RTL for ISP workflowMaikon
 
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...LEGATO project
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd Iaetsd
 
SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in...
SX Aurora TSUBASA  (Vector Engine) a Brand-new Vector Supercomputing power in...SX Aurora TSUBASA  (Vector Engine) a Brand-new Vector Supercomputing power in...
SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in...inside-BigData.com
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Intel® Software
 
Extracting a Rails Engine to a separated application
Extracting a Rails Engine to a separated applicationExtracting a Rails Engine to a separated application
Extracting a Rails Engine to a separated applicationJônatas Paganini
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural NetworksShinya Takamaeda-Y
 
An open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V coresAn open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V coresRISC-V International
 

La actualidad más candente (20)

IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...
Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...
Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & Productivity
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
 
Bs25412419
Bs25412419Bs25412419
Bs25412419
 
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Moldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesMoldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devices
 
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
 
Demosaic RTL for ISP workflow
Demosaic RTL for ISP workflowDemosaic RTL for ISP workflow
Demosaic RTL for ISP workflow
 
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg as
 
SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in...
SX Aurora TSUBASA  (Vector Engine) a Brand-new Vector Supercomputing power in...SX Aurora TSUBASA  (Vector Engine) a Brand-new Vector Supercomputing power in...
SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in...
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 
Extracting a Rails Engine to a separated application
Extracting a Rails Engine to a separated applicationExtracting a Rails Engine to a separated application
Extracting a Rails Engine to a separated application
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
 
An open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V coresAn open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V cores
 

Similar a LEGaTO: Software Stack Runtimes

byteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceLEGATO project
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA accelerationMarco77328
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 
electronics-11-03883.pdf
electronics-11-03883.pdfelectronics-11-03883.pdf
electronics-11-03883.pdfRioCarthiis
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)vaidehi87
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)Yuuki Takano
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...LEGATO project
 
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraRealizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraMasaharu Munetomo
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMchiportal
 
hetshah_resume
hetshah_resumehetshah_resume
hetshah_resumehet shah
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
 
6Tisch telecom_bretagne_2016
6Tisch telecom_bretagne_20166Tisch telecom_bretagne_2016
6Tisch telecom_bretagne_2016Pascal Thubert
 

Similar a LEGaTO: Software Stack Runtimes (20)

byteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA Solutions
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA acceleration
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
electronics-11-03883.pdf
electronics-11-03883.pdfelectronics-11-03883.pdf
electronics-11-03883.pdf
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)
 
FPGA In a Nutshell
FPGA In a NutshellFPGA In a Nutshell
FPGA In a Nutshell
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...
 
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
 
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraRealizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
hetshah_resume
hetshah_resumehetshah_resume
hetshah_resume
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
6Tisch telecom_bretagne_2016
6Tisch telecom_bretagne_20166Tisch telecom_bretagne_2016
6Tisch telecom_bretagne_2016
 

Más de LEGATO project

Scrooge Attack: Undervolting ARM Processors for Profit
Scrooge Attack: Undervolting ARM Processors for ProfitScrooge Attack: Undervolting ARM Processors for Profit
Scrooge Attack: Undervolting ARM Processors for ProfitLEGATO project
 
A practical approach for updating an integrity-enforced operating system
A practical approach for updating an integrity-enforced operating systemA practical approach for updating an integrity-enforced operating system
A practical approach for updating an integrity-enforced operating systemLEGATO project
 
TEEMon: A continuous performance monitoring framework for TEEs
TEEMon: A continuous performance monitoring framework for TEEsTEEMon: A continuous performance monitoring framework for TEEs
TEEMon: A continuous performance monitoring framework for TEEsLEGATO project
 
secureTF: A Secure TensorFlow Framework
secureTF: A Secure TensorFlow FrameworksecureTF: A Secure TensorFlow Framework
secureTF: A Secure TensorFlow FrameworkLEGATO project
 
PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...
PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...
PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...LEGATO project
 
LEGaTO: Machine Learning Use Case
LEGaTO: Machine Learning Use CaseLEGaTO: Machine Learning Use Case
LEGaTO: Machine Learning Use CaseLEGATO project
 
Smart Home AI at the edge
Smart Home AI at the edgeSmart Home AI at the edge
Smart Home AI at the edgeLEGATO project
 
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the projectLEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the projectLEGATO project
 
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGATO project
 
LEGaTO: Low-Energy Heterogeneous Computing Workshop
LEGaTO: Low-Energy Heterogeneous Computing WorkshopLEGaTO: Low-Energy Heterogeneous Computing Workshop
LEGaTO: Low-Energy Heterogeneous Computing WorkshopLEGATO project
 
TZ4Fabric: Executing Smart Contracts with ARM TrustZone
TZ4Fabric: Executing Smart Contracts with ARM TrustZoneTZ4Fabric: Executing Smart Contracts with ARM TrustZone
TZ4Fabric: Executing Smart Contracts with ARM TrustZoneLEGATO project
 
Infection Research with Maxeler Dataflow Computing
Infection Research with Maxeler Dataflow ComputingInfection Research with Maxeler Dataflow Computing
Infection Research with Maxeler Dataflow ComputingLEGATO project
 
Smart Home - AI at the edge
Smart Home - AI at the edgeSmart Home - AI at the edge
Smart Home - AI at the edgeLEGATO project
 
FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency
FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-ResiliencyFPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency
FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-ResiliencyLEGATO project
 
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsScheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsLEGATO project
 
RECS – Cloud to Edge Microserver Platform for Energy-Efficient Computing
RECS – Cloud to Edge Microserver Platform for Energy-Efficient ComputingRECS – Cloud to Edge Microserver Platform for Energy-Efficient Computing
RECS – Cloud to Edge Microserver Platform for Energy-Efficient ComputingLEGATO project
 
Secure Task-Based Programming with OmpSs and SGX
Secure Task-Based Programming with OmpSs and SGXSecure Task-Based Programming with OmpSs and SGX
Secure Task-Based Programming with OmpSs and SGXLEGATO project
 
HiPerMAb: A statistical tool for judging the potential of short fat data
HiPerMAb: A statistical tool for judging the potential of short fat dataHiPerMAb: A statistical tool for judging the potential of short fat data
HiPerMAb: A statistical tool for judging the potential of short fat dataLEGATO project
 

Más de LEGATO project (20)

Scrooge Attack: Undervolting ARM Processors for Profit
Scrooge Attack: Undervolting ARM Processors for ProfitScrooge Attack: Undervolting ARM Processors for Profit
Scrooge Attack: Undervolting ARM Processors for Profit
 
A practical approach for updating an integrity-enforced operating system
A practical approach for updating an integrity-enforced operating systemA practical approach for updating an integrity-enforced operating system
A practical approach for updating an integrity-enforced operating system
 
TEEMon: A continuous performance monitoring framework for TEEs
TEEMon: A continuous performance monitoring framework for TEEsTEEMon: A continuous performance monitoring framework for TEEs
TEEMon: A continuous performance monitoring framework for TEEs
 
secureTF: A Secure TensorFlow Framework
secureTF: A Secure TensorFlow FrameworksecureTF: A Secure TensorFlow Framework
secureTF: A Secure TensorFlow Framework
 
PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...
PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...
PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...
 
LEGaTO: Machine Learning Use Case
LEGaTO: Machine Learning Use CaseLEGaTO: Machine Learning Use Case
LEGaTO: Machine Learning Use Case
 
Smart Home AI at the edge
Smart Home AI at the edgeSmart Home AI at the edge
Smart Home AI at the edge
 
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the projectLEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
 
LEGaTO Integration
LEGaTO IntegrationLEGaTO Integration
LEGaTO Integration
 
LEGaTO: Use cases
LEGaTO: Use casesLEGaTO: Use cases
LEGaTO: Use cases
 
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous Hardware
 
LEGaTO: Low-Energy Heterogeneous Computing Workshop
LEGaTO: Low-Energy Heterogeneous Computing WorkshopLEGaTO: Low-Energy Heterogeneous Computing Workshop
LEGaTO: Low-Energy Heterogeneous Computing Workshop
 
TZ4Fabric: Executing Smart Contracts with ARM TrustZone
TZ4Fabric: Executing Smart Contracts with ARM TrustZoneTZ4Fabric: Executing Smart Contracts with ARM TrustZone
TZ4Fabric: Executing Smart Contracts with ARM TrustZone
 
Infection Research with Maxeler Dataflow Computing
Infection Research with Maxeler Dataflow ComputingInfection Research with Maxeler Dataflow Computing
Infection Research with Maxeler Dataflow Computing
 
Smart Home - AI at the edge
Smart Home - AI at the edgeSmart Home - AI at the edge
Smart Home - AI at the edge
 
FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency
FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-ResiliencyFPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency
FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency
 
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsScheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
 
RECS – Cloud to Edge Microserver Platform for Energy-Efficient Computing
RECS – Cloud to Edge Microserver Platform for Energy-Efficient ComputingRECS – Cloud to Edge Microserver Platform for Energy-Efficient Computing
RECS – Cloud to Edge Microserver Platform for Energy-Efficient Computing
 
Secure Task-Based Programming with OmpSs and SGX
Secure Task-Based Programming with OmpSs and SGXSecure Task-Based Programming with OmpSs and SGX
Secure Task-Based Programming with OmpSs and SGX
 
HiPerMAb: A statistical tool for judging the potential of short fat data
HiPerMAb: A statistical tool for judging the potential of short fat dataHiPerMAb: A statistical tool for judging the potential of short fat data
HiPerMAb: A statistical tool for judging the potential of short fat data
 

Último

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Silpa
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptxSilpa
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxSilpa
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Silpa
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Silpa
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body Areesha Ahmad
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Silpa
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Silpa
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 

Último (20)

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 

LEGaTO: Software Stack Runtimes

  • 1. The LEGaTO project has received funding from the European Union's Horizon 2020 research and innovation programme under the grant agreement No 780681 10/13/20 LEGaTO: Software Stack Runtimes HiPEAC 2020 Computer Systems Week 16-10-2020 Miquel Pericas Chalmers University of Technology
  • 2. 2 HiPEAC CSW Autumn 2020 • Middleware – SLURM and RedFish • OmpSs@FPGA (Xavier) • XiTAO: −Introduction: XiTAO execution Model −Energy Aware Scheduler −Software Topologies −Pipeline parallelism • FPGA Undervolting • Fault tolerance - GPU Checkpointing Outline
  • 3. HiPEAC CSW Autumn 2020 Slurm and RECS Master • Integration of Slurm with RECS Master o Nodes specification at slurm configuration (partitions, limits…) o Slurm gets node specification and selects target nodes o Allocates, joins and starts nodes o Executes the application(s) o Shuts-down nodes and destroys allocation 3 $ sinfo PART… AVAIL LIMIT NODES STATE NODELIST debug* up infinite 1 idle* pcxavim5 debug* up infinite 16 idle BB_1_[0,2-15],pcxavim6
  • 4. HiPEAC CSW Autumn 2020 Slurm and RECS Master • Slurm contacts RECS Master at job execution and termination times 4 #!/bin/bash #SBATCH -N 10 #SBATCH --constraint=ARM,bigLITTLE,hasGPU #SBATCH -o test-%j.out #SBATCH -e test-%j.err // App invocation $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle* pcxavim5 debug* up infinite 10 alloc BB_1_[0,2-10] debug* up infinite 6 idle BB_1_[11-15],pcxavim6 $ sbatch batch-10-bl.sh Submitted batch job 39 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 39 debug batch-10 xavim R 0:42 10 BB_1_[0,2-10]
  • 5. HiPEAC CSW Autumn 2020 Slurm and RECS Master • Composed nodes are created using the RECS Master webservice • And started and stopped automatically 5 10 nodes are turned on
  • 6. 6 HiPEAC CSW Autumn 2020 OmpSs@FPGA ● Offload of matrix multiplication to FPGA #pragma omp target device(fpga) num_instances(3) #pragma omp task in([BSIZE*BSIZE]a, [BSIZE*BSIZE]b) inout([BSIZE*BSIZE]c) void matmulBlock(const elem_t *a, const elem_t *b, elem_t *c) { #pragma HLS INLINE // off #pragma HLS array_partition variable=a cyclic factor=4 #pragma HLS array_partition variable=b cyclic factor=BSIZE/4 #pragma HLS array_partition variable=c cyclic factor=BSIZE/2 for (int k = 0; k < BSIZE; ++k) { … } } FPGA
  • 7. 7 HiPEAC CSW Autumn 2020 ● Acceleration of matrix multiplication on FPGAs − 4 ARM cores (OpenBLAS) − 1 to 3 IP cores ● Block size 256x256 0 1 2 3 4 5 6 7 8 0 20 40 60 80 100 4 ARM cores 1 IP core 2 IP cores 3 IP cores GFlops/W GFlops Axis Title Matrix multiply, energy efficiency Gflops Gflops/W ● Best performance ● 3 IP cores ● Best energy-efficiency ● 2 IP cores OmpSs@FPGA
  • 8. 8 HiPEAC CSW Autumn 2020 XiTAO: Energy Aware Scheduler • Module 1: Power Profiling • help runtime understand CPU power consumption trends (number/type of cores, different frequencies) • • Module 2: Dynamic Performance Modeling • provide accurate prediction for future task given a set of resources • independent of platforms and frequencies • achieve scalablity and portablity goals • • Module 3: Idleness Tracing • give the information about real-time status of cores • put cores to ”sleep” when it is under-utilized • sleeping time exploits backoff exponential strategy • provide the real-time parallel slackness of active cores => calculation of shared board static power on each running task • • Module 4: Task Mapping Algorithm (Per task level) For a given configuration (Start core, number of cores): • Performance Tracer => Execution Time Prediction • Power Profiles => Dynamic Power Prediction • Power Profiles + Idleness Tracer => Static Power Prediction • Energy Prediction = (Static Power + Dynamic Power) x Execution Time
  • 9. 9 HiPEAC CSW Autumn 2020 XiTAO: Energy Aware Scheduler ● 31%-74% energy reduction than RWS ● 19%-68% energy reduction than FCC ● 25%-73% energy reduction than LCC Name Acronym ● Notion Random Work Stealing (+Sleep) RWS (+S) Typical greedy scheduling (enhanced with Sleep) Fastest Cores with Criticality (+Sleep) FCC (+S) Critical tasks are mapped to the set of cores that minimize execution time and are not allowed work stealing, noncritical tasks follow parent queue and only search for the best number of cores that minimize the execution time of the task (enhanced with Sleep) Lowest Cost with Criticality (+Sleep) LCC (+S) The difference between LCC and FCC is that minimizing execution time becomes minimizing parallel cost. The parallel cost means ”execution time * number of cores” (enhanced with Sleep) Lowest Energy without Criticality LENC Task scheduling targets lowest energy, no need for criticality awareness
  • 10. 10 HiPEAC CSW Autumn 2020 STA train Sched • Mapping logical data locations to physical locations (to create a model per locality) • The Software Topology Address (STA) is a portable key that is to be interpreted by the XiTAO runtime to map a task to a place. • Example: space filling order is used as an STA, transforming coordinates to an integer for Cartesian inputs. Paper includes other example such keys. • This STA-to-location mapping is leveraged to model the performance per task’s data locality • A performance model per the (STA, task_type) tuple is created • Energy aware model can be potentially used here. • Example system’s elastic partitions to be used by the model XiTAO: Software & Hardware Topologies
  • 11. 11 HiPEAC CSW Autumn 2020 XiTAO: Model Validation on DAG Chain •Adaptive resource selection (leader, width) for an cache intensive task. Green is NUMA node where task (depicted by STA) is initialized •Scheduler mostly chooses widths 1 and 2 (within the shared L2 cache) • Adaptive resource selection (leader, width) for a memory intensive task. • Scheduler mostly chooses widths 12 (a socket encapsulating 2 NUMA nodes) • Random work-stealing behavior for compute bound tasks while preferring larger widths • Scalability of model running memory-bound DAG chains. Up to 2.5x speedup with larger task count • To validate the STA-driven performance modeling, we − Test on a 4-socket AMD system (2 NUMA each) − Print a resource selection trace of a chain of tasks • The scheduler adaptively behaves as locality-aware for memory/cach intensive tasks, and as a work-stealing scheduler for compute bound tasks
  • 12. 12 HiPEAC CSW Autumn 2020 XiTAO: Moldable pipelines for CNNs on heterogenous edge devices ● A simple template tensor language to develop CNN networks. ● XiTAO Pipelines are generated using the information provided by language interface. ● An online training phase determines the optimal pipeline configuration. • Network Layer distribution among pipeline stages. • Resource partitioning among pipeline stages ● The training is led by a search algorithm which utilizes computational hints provided by the language interface.
  • 13. 13 HiPEAC CSW Autumn 2020 Network description in template language main(){ … Conv1 = CONV(ip, op, weights); Conv2 = CONV(conv1, op, weights); …. network.add(Conv1); network.add(Conv2); … network.execute(); } XiTAO: Moldable pipelines for CNNs on heterogenous edge devices
  • 14. 14 HiPEAC CSW Autumn 2020 FPGA Undervolting Problem: FPGAs are at least 10X less power-efficient than equivalent ASICs Goal: Bridge the power-efficiency gap between ASICs and FPGAs by Undervolting below nominal level • Case Study: Power consumption of neural networks is a main concern ✔ Hardware acceleration: GPUs, FPGAs, and ASICs Evaluation Setup ✔ 5 Image classification workloads ✔ 3 Xilinx UltraScale+ ZCU102 platforms ✔ 2 On-chip voltage rails Main Results ✔ Large voltage guardband (i.e., 33%) ✔ >3X power-efficiency gain
  • 15. 15 HiPEAC CSW Autumn 2020 Overall Voltage Behavior Slight variation of voltage behavior across platforms and benchmarks ❑ FPGA stops operatingCrash ❑ No performance or reliability loss ❑ Added by the vendor to ensure the worst-case conditions ❑ Large guardband, average of 33% Guard band ❑ A narrow voltage region ❑ Neural network accuracy collapseCritical
  • 16. 16 HiPEAC CSW Autumn 2020 GPU Checkpointing with FTI ● Transparent multi- GPU/multi-node checkpointing ● Parallel streams to improve I/O efficiency ● Fast checksum calculation using GPUs MD5 algorithm
  • 17. 17 HiPEAC CSW Autumn 2020 GPU Checkpointing with FTI ● Over 100x speed up with the new GPU MD5 algorithm ● Checkpoint takes less than 1 second ● FPGA checkpoint implementation coming