Device Data Directory and Asynchronous execution: A path to heterogeneous computing with OmpSs 2

•

0 recomendaciones•46 vistas

LEGATO project

Poster presented by Rubén Cano at the LEGaTO Final Event: 'Low-Energy Heterogeneous Computing Workshop'

Ciencias

The LEGaTO project has received funding from the European Union’s
Horizon 2020 research and innovation programme under the grant
agreement No 780681.
www.legato-project.eu
Device Data Directory and Asynchronous execution:
A path to heterogeneous computing with OmpSs-2
Rubén Cano, Carlos Álvarez, Daniel Jiménez-González, Xavier Martorell
Barcelona Supercomputing Center and Universitat Politècnica de Catalunya
Benchmarking
Device: GPU
Algorithm: Matrix Multiply Configuration: Matrix Size [16384]
Device: FPGA
Algorithm: Matrix Multiply Configuration: Block Size [256]
Conclusions
• Better performance than CUDA Unified-memory hardware
approach.
• Improved support for range dependences and multi-device
copies with the same performance as OmpSs runtime.
• Framework to easily adapt any new device to OmpSs-2
runtime.
References
• OmpSs-2 Programming Model: https://pm.bsc.es/ompss-2
• Nanos6 repository: https://github.com/bsc-pm/nanos6
• OmpSs@FPGA: https://pm.bsc.es/ompss-at-fpga
Depends
in(x[0; N])
out(y[0; N])
Kind (FPGA, CUDA…)
Task
Instance (Device Id)
Function (Saxpy ,matmul…)
Accelerator
Allocation Engine
(Allocates, reallocates and frees device-
memory)
Directory
Host-Device range-
based mapping cache
(Keeps track of the validity status of any
given region in any device. Can translate
addresses from host to any device, for a
given region)
Copy Engine
(Generates copies between devices)
Symbol-aware mapping
(Dependencies that are part of the same
symbol, but are non-contiguous in memory,
will have the same offsets between them in
the device mapping)
Stream
COPIES
TASK
EXECUTION
TASK
FINALIZATION
Dependency System
This work has been supported by the Ministry of Science
and Innovation, under the project "Computación de
Altas Prestaciones VIII" (PID2019-107255GB).
Problem
1
2
3
Memory management and communication of different
devices is challenging and error-prone.
Hardware approaches relax the memory model, but not
all the devices have support for these mechanisms.
These mechanisms usually are page-based, which can incur
in huge performance-degradation due to false-sharing.
Task is ready to be
executed
Hardware-accelerator selected.
Ensure the symbol validity
Set-up a symbol-translation table
to translate host-pointers to the
destination device.
If a symbol is not already valid, enqueue all the
copies from a valid address-space into the
destination device memory.
Can a software Unified Memory model be faster than current hardware-
based solutions?
Research
Proof-of-concept Solution
Device Directory
Unifies the memory model managing device
memories explicitly, and ensuring the
availability of the data before executing.
Stream
Unifies the execution-model of any device
to be an asynchronous queue of
sequential operations.
Zynq UltraScale+ 9EG
3 [256X256] matmul accelerators
IBM Power9 8335-GTH
NVIDIA V100 x1
We would like to thank Xilinx
University Program for software and
boards donations.
OmpSs OmpSs-2

Más contenido relacionado

La actualidad más candente

OpenCL caffe IWOCL 2016 presentation finalJunli Gu

High Performance Parallel Computing with Clouds and Cloud Technologiesjaliyae

Cloud, Fog, or Edge: Where and When to Compute?Förderverein Technische Fakultät

From Cloud to Fog: the Tao of IT Infrastructure DecentralizationFogGuru MSCA Project

Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraMasaharu Munetomo

APSys Presentation Final copy2Junli Gu

HPC with Clouds and Cloud TechnologiesInderjeet Singh

Control of computing systemsFogGuru MSCA Project

AI On the Edge: Model CompressionApache MXNet

Stream Processing FogGuru MSCA Project

Architecture and Performance of Runtime Environments for Data Intensive Scala...jaliyae

Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane

Varun Gatne - Resume - FinalVarun Gatne

IEEE CloudCom 2014参加報告Ryousei Takano

08 Supercomputer FugakuRCCSRENKEI

Scalable Parallel Computing on CloudsThilina Gunarathne

Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems

Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...Bharath Sudharsan

Expectations for optical network from the viewpoint of system software researchRyousei Takano

hetshah_resumehet shah

La actualidad más candente (20)

OpenCL caffe IWOCL 2016 presentation final

High Performance Parallel Computing with Clouds and Cloud Technologies

Cloud, Fog, or Edge: Where and When to Compute?

From Cloud to Fog: the Tao of IT Infrastructure Decentralization

Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era

APSys Presentation Final copy2

HPC with Clouds and Cloud Technologies

Control of computing systems

AI On the Edge: Model Compression

Stream Processing

Architecture and Performance of Runtime Environments for Data Intensive Scala...

Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...

Varun Gatne - Resume - Final

IEEE CloudCom 2014参加報告

08 Supercomputer Fugaku

Scalable Parallel Computing on Clouds

Improving Efficiency of Machine Learning Algorithms using HPCC Systems

Globe2Train: A Framework for Distributed ML Model Training using IoT Devices ...

Expectations for optical network from the viewpoint of system software research

hetshah_resume

Similar a Device Data Directory and Asynchronous execution: A path to heterogeneous computing with OmpSs 2

Lecture_IIITD.pptxachakracu

ZCloud Consensus on Hardware for Distributed SystemsGokhan Boranalp

FPGA Hardware Accelerator for Machine Learning Dr. Swaminathan Kathirvel

Exploring emerging technologies in the HPC co-design spacejsvetter

PIMRC-2012, Sydney, Australia, 28 July, 2012Charith Perera

37248136-Nano-Technology.pdfTB107thippeswamyM

2014 IEEE JAVA MOBILE COMPUTING PROJECT Efficient and privacy aware data aggr...IEEEFINALYEARSTUDENTSPROJECTS

2014 IEEE JAVA MOBILE COMPUTING PROJECT Efficient and privacy aware data aggr...IEEEFINALYEARSTUDENTPROJECT

IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Efficient and privacy aware data agg...IEEEFINALYEARSTUDENTPROJECTS

From Rack scale computers to Warehouse scale computersRyousei Takano

SmartblitzmerkerSmart Blitzmerker

ParaForming - Patterns and Refactoring for Parallel Programmingkhstandrews

Priorities Shift In IC DesignAbacus Technologies

Cisco project ideasVIT University

An octa core processor with shared memory and message-passingeSAT Journals

team12.project_ver_1_(1).pptxRitwikShrivastava1

HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGcscpconf

OpenACC and Hackathons Monthly Highlights: April 2023OpenACC

Shantanu's ResumeShantanu Telharkar

IRJET- ALPYNE - A Grid Computing FrameworkIRJET Journal

Similar a Device Data Directory and Asynchronous execution: A path to heterogeneous computing with OmpSs 2 (20)

Lecture_IIITD.pptx

ZCloud Consensus on Hardware for Distributed Systems

FPGA Hardware Accelerator for Machine Learning

Exploring emerging technologies in the HPC co-design space

PIMRC-2012, Sydney, Australia, 28 July, 2012

37248136-Nano-Technology.pdf

2014 IEEE JAVA MOBILE COMPUTING PROJECT Efficient and privacy aware data aggr...

IEEE 2014 JAVA MOBILE COMPUTING PROJECTS Efficient and privacy aware data agg...

From Rack scale computers to Warehouse scale computers

Smartblitzmerker

ParaForming - Patterns and Refactoring for Parallel Programming

Priorities Shift In IC Design

Cisco project ideas

An octa core processor with shared memory and message-passing

team12.project_ver_1_(1).pptx

HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING

OpenACC and Hackathons Monthly Highlights: April 2023

Shantanu's Resume

IRJET- ALPYNE - A Grid Computing Framework

Más de LEGATO project

Scrooge Attack: Undervolting ARM Processors for ProfitLEGATO project

A practical approach for updating an integrity-enforced operating systemLEGATO project

TEEMon: A continuous performance monitoring framework for TEEsLEGATO project

secureTF: A Secure TensorFlow FrameworkLEGATO project

PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...LEGATO project

LEGaTO: Machine Learning Use CaseLEGATO project

Smart Home AI at the edgeLEGATO project

LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the projectLEGATO project

LEGaTO IntegrationLEGATO project

LEGaTO: Use casesLEGATO project

LEGaTO: Software Stack Programming ModelsLEGATO project

LEGaTO: Software Stack RuntimesLEGATO project

LEGaTO Heterogeneous HardwareLEGATO project

LEGaTO: Low-Energy Heterogeneous Computing WorkshopLEGATO project

TZ4Fabric: Executing Smart Contracts with ARM TrustZoneLEGATO project

Infection Research with Maxeler Dataflow ComputingLEGATO project

Smart Home - AI at the edgeLEGATO project

FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-ResiliencyLEGATO project

Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsLEGATO project

RECS – Cloud to Edge Microserver Platform for Energy-Efficient ComputingLEGATO project

Más de LEGATO project (20)

Scrooge Attack: Undervolting ARM Processors for Profit

A practical approach for updating an integrity-enforced operating system

TEEMon: A continuous performance monitoring framework for TEEs

secureTF: A Secure TensorFlow Framework

PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...

LEGaTO: Machine Learning Use Case

Smart Home AI at the edge

LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project

LEGaTO Integration

LEGaTO: Use cases

LEGaTO: Software Stack Programming Models

LEGaTO: Software Stack Runtimes

LEGaTO Heterogeneous Hardware

LEGaTO: Low-Energy Heterogeneous Computing Workshop

TZ4Fabric: Executing Smart Contracts with ARM TrustZone

Infection Research with Maxeler Dataflow Computing

Smart Home - AI at the edge

FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency

Scheduling Task-parallel Applications in Dynamically Asymmetric Environments

RECS – Cloud to Edge Microserver Platform for Energy-Efficient Computing

Último

Nanoparticles synthesis and characterization kaibalyasahoo82800

GBSN - Biochemistry (Unit 1)Areesha Ahmad

Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju

Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani

Seismic Method Estimate velocity from seismic data.pptxAlMamun560346

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani

Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani

GBSN - Microbiology (Unit 1)Areesha Ahmad

COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed

Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora

Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25

Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74

Animal Communication- Auditory and Visual.pptxUmerFayaz5

GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...Lokesh Kothari

CELL -Structural and Functional unit of life.pdfNistarini College, Purulia (W.B) India

Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav

Device Data Directory and Asynchronous execution: A path to heterogeneous computing with OmpSs 2

1. The LEGaTO project has received funding from the European Union’s Horizon 2020 research and innovation programme under the grant agreement No 780681. www.legato-project.eu Device Data Directory and Asynchronous execution: A path to heterogeneous computing with OmpSs-2 Rubén Cano, Carlos Álvarez, Daniel Jiménez-González, Xavier Martorell Barcelona Supercomputing Center and Universitat Politècnica de Catalunya Benchmarking Device: GPU Algorithm: Matrix Multiply Configuration: Matrix Size [16384] Device: FPGA Algorithm: Matrix Multiply Configuration: Block Size [256] Conclusions • Better performance than CUDA Unified-memory hardware approach. • Improved support for range dependences and multi-device copies with the same performance as OmpSs runtime. • Framework to easily adapt any new device to OmpSs-2 runtime. References • OmpSs-2 Programming Model: https://pm.bsc.es/ompss-2 • Nanos6 repository: https://github.com/bsc-pm/nanos6 • OmpSs@FPGA: https://pm.bsc.es/ompss-at-fpga Depends in(x[0; N]) out(y[0; N]) Kind (FPGA, CUDA…) Task Instance (Device Id) Function (Saxpy ,matmul…) Accelerator Allocation Engine (Allocates, reallocates and frees device- memory) Directory Host-Device range- based mapping cache (Keeps track of the validity status of any given region in any device. Can translate addresses from host to any device, for a given region) Copy Engine (Generates copies between devices) Symbol-aware mapping (Dependencies that are part of the same symbol, but are non-contiguous in memory, will have the same offsets between them in the device mapping) Stream COPIES TASK EXECUTION TASK FINALIZATION Dependency System This work has been supported by the Ministry of Science and Innovation, under the project "Computación de Altas Prestaciones VIII" (PID2019-107255GB). Problem 1 2 3 Memory management and communication of different devices is challenging and error-prone. Hardware approaches relax the memory model, but not all the devices have support for these mechanisms. These mechanisms usually are page-based, which can incur in huge performance-degradation due to false-sharing. Task is ready to be executed Hardware-accelerator selected. Ensure the symbol validity Set-up a symbol-translation table to translate host-pointers to the destination device. If a symbol is not already valid, enqueue all the copies from a valid address-space into the destination device memory. Can a software Unified Memory model be faster than current hardware- based solutions? Research Proof-of-concept Solution Device Directory Unifies the memory model managing device memories explicitly, and ensuring the availability of the data before executing. Stream Unifies the execution-model of any device to be an asynchronous queue of sequential operations. Zynq UltraScale+ 9EG 3 [256X256] matmul accelerators IBM Power9 8335-GTH NVIDIA V100 x1 We would like to thank Xilinx University Program for software and boards donations. OmpSs OmpSs-2

Device Data Directory and Asynchronous execution: A path to heterogeneous computing with OmpSs 2

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Device Data Directory and Asynchronous execution: A path to heterogeneous computing with OmpSs 2

Similar a Device Data Directory and Asynchronous execution: A path to heterogeneous computing with OmpSs 2 (20)

Más de LEGATO project

Más de LEGATO project (20)

Último

Último (20)

Device Data Directory and Asynchronous execution: A path to heterogeneous computing with OmpSs 2