SC13: OpenMP and NVIDIA

•

1 recomendación•1,251 vistas

Jeff Larkin

This talk was given at the OpenMP booth at Supercomputing 2013.

Tecnología

OpenMP and NVIDIA
Jeff Larkin
NVIDIA Developer Technologies

OpenMP and NVIDIA
OpenMP is the dominant standard for directive-based parallel
programming.
NVIDIA joined OpenMP in 2011 to contribute to discussions around
parallel accelerators.
NVIDIA proposed the TEAMS construct for accelerators in 2012
OpenMP 4.0 with accelerator support released in 2013

3 Ways to Accelerate Applications
Applications
Libraries

Compiler
Directives

Programming
Languages

“Drop-in”
Acceleration

Easily Accelerate
Applications

Maximum
Flexibility

Reaching a Broader Set of Developers

1,000,000’s

Universities
Supercomputing Centers
Oil & Gas

100,000’s
Research

Early Adopters

2004

Time

CAE
CFD
Finance
Rendering
Data Analytics
Life Sciences
Defense
Weather
Climate
Plasma Physics

Present

OpenMP PARALLEL FOR
Executes the iterations of the
next for loop in parallel across
a team of threads.

#pragma omp parallel for
for (i=0; i<N; i++)
p[i] = v1[i] * v2[i];

OpenMP TARGET PARALLEL FOR
Offloads data and execution to
a target device, then
Executes the iterations of the
next for loop in parallel across
a team of threads.

#pragma omp target
#pragma omp parallel for
for (i=0; i<N; i++)
p[i] = v1[i] * v2[i];

OpenMP 4.0 TEAMS/DISTRIBUTE
Creates a league of teams on
the target device, distributes
blocks of work among those
teams, and executes the
remaining work in parallel
within each team

This code is portable whether 1
team or many teams are used.

#pragma omp target teams
#pragma omp
distribute parallel for
reduction(+:sum)
for (i=0; i<N; i++)
sum += B[i] * C[i];

OpenMP 4.0 Teams Distribute Parallel For
#pragma omp target
#pragma omp parallel for reduction(+:sum)
for (i=0; i< ??;i++) sum += B[i] **C[i];
N; i++) sum += B[i]
C[i];

OpenMP 4.0 Teams Distribute Parallel For
#pragma omp target teams
#pragma omp distribute parallel for
#pragma omp parallel for reduction(+:sum)
for (i=0; i< ??; i++) sum += B[i] * C[i];

The programmer
doesn’t have to
#pragma omp parallel for reduction(+:sum) think about how
the loop is
for (i=??; i< ??; i++) sum += B[i] * C[i];
decomposed!

#pragma omp parallel for reduction(+:sum)
for (i=??; i< ??; i++) sum += B[i] * C[i];
#pragma omp parallel for reduction(+:sum)
for (i=??; i< N; i++) sum += B[i] * C[i];

OMP + NV: We’re not done yet!
Hardware parallelism is not going away, programmers demand a
simple, portable way to use it.
OpenMP 4.0 is just the first step toward a portable standard for
directive-based acceleration
We will continue to work with OpenMP to address the challenges
of parallel computing
Improved Interoperability
Improved Portability
Improved Expressibility

Más contenido relacionado

La actualidad más candente

OpenmpAmirali Sharifian

Introduction to OpenMPAkhila Prabhakaran

Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly

Concurrent Programming OpenMP @ Distributed System DiscussionCherryBerry2

Distributed Prioritized Experience Replay(Ape-X)Younggyo Seo

Compiler Optimization-Space Explorationtmusabbir

Peephole OptimizationUnited International University

Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...Edward Willink

Nug2004 yheYassine Rafrafi

Callgraph analysisRoberto Agostino Vitillo

Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...Databricks

Introduction to OpenMP (Performance)Akhila Prabhakaran

Lecture 1guest6c6268

CILK/CILK++ and ReducersYunming Zhang

Two-level Just-in-Time Compilation with One Interpreter and One EngineYusuke Izawa

openmpNeel Bhad

Mit cilkRaymond Kung

Matlab m files and scriptsAmeen San

Lesson 9.1 value returningMLG College of Learning, Inc

Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Yusuke Izawa

La actualidad más candente (20)

Openmp

Introduction to OpenMP

Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...

Concurrent Programming OpenMP @ Distributed System Discussion

Distributed Prioritized Experience Replay(Ape-X)

Compiler Optimization-Space Exploration

Peephole Optimization

Local Optimizations in Eclipse QVTc and QVTr using the Micro-Mapping Model of...

Nug2004 yhe

Callgraph analysis

Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...

Introduction to OpenMP (Performance)

Lecture 1

CILK/CILK++ and Reducers

Two-level Just-in-Time Compilation with One Interpreter and One Engine

openmp

Mit cilk

Matlab m files and scripts

Lesson 9.1 value returning

Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...

Similar a SC13: OpenMP and NVIDIA

OpenPOWER Webinar from University of Delaware - Title :OpenMP (offloading) o...Ganesan Narayanasamy

GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5Jeff Larkin

Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf

Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin

Parallel Programming on the ANDC clusterSudhang Shankar

DevOps Demystifiedadil raja

Chapel Comes of Age: a Language for Productivity, Parallelism, and Performanceinside-BigData.com

Maths&programming forartists wipkedar nath

Parallel processing -open mpTanjilla Sarkar

Better Code: ConcurrencyPlatonov Sergey

Python introductionLearnbay Datascience

NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib

Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf

Matlab pptDhammpal Ramtake

Parallel and Async Programming With C#Rainer Stropek

What`s New in Java 8Mohsen Zainalpour

presentationWilliam Cunningham

25-MPI-OpenMP.pptxGopalPatidar13

R4ML: An R Based Scalable Machine Learning FrameworkAlok Singh

Data ScienceSubhajit75

Similar a SC13: OpenMP and NVIDIA (20)

OpenPOWER Webinar from University of Delaware - Title :OpenMP (offloading) o...

GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5

Tom Peters, Software Engineer, Ufora at MLconf ATL 2016

Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs

Parallel Programming on the ANDC cluster

DevOps Demystified

Chapel Comes of Age: a Language for Productivity, Parallelism, and Performance

Maths&programming forartists wip

Parallel processing -open mp

Better Code: Concurrency

Python introduction

NYAI - Scaling Machine Learning Applications by Braxton McKee

Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16

Matlab ppt

Parallel and Async Programming With C#

What`s New in Java 8

presentation

25-MPI-OpenMP.pptx

R4ML: An R Based Scalable Machine Learning Framework

Data Science

Más de Jeff Larkin

Best Practices for OpenMP on GPUs - OpenMP UK Users GroupJeff Larkin

FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesJeff Larkin

Performance Portability Through Descriptive ParallelismJeff Larkin

GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin

Refactoring Applications for the XK7 and Future Hybrid ArchitecturesJeff Larkin

Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin

Progress Toward Accelerating CAM-SEJeff Larkin

HPCMPUG2011 cray tutorialJeff Larkin

CUG2011 Introduction to GPU ComputingJeff Larkin

Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin

May2010 hex-core-optJeff Larkin

A Comparison of Accelerator Programming ModelsJeff Larkin

Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin

XT Best PracticesJeff Larkin

Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Jeff Larkin

Más de Jeff Larkin (15)

Best Practices for OpenMP on GPUs - OpenMP UK Users Group

FortranCon2020: Highly Parallel Fortran and OpenACC Directives

Performance Portability Through Descriptive Parallelism

GTC16 - S6510 - Targeting GPUs with OpenMP 4.5

Refactoring Applications for the XK7 and Future Hybrid Architectures

Optimizing GPU to GPU Communication on Cray XK7

Progress Toward Accelerating CAM-SE

HPCMPUG2011 cray tutorial

CUG2011 Introduction to GPU Computing

Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...

May2010 hex-core-opt

A Comparison of Accelerator Programming Models

Cray XT Porting, Scaling, and Optimization Best Practices

XT Best Practices

Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)

Último

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

How to write a Business Continuity PlanDatabarracks

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Take control of your SAP testing with UiPath Test SuiteDianaGray10

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

unit 4 immunoblotting technique complete.pptxBkGupta21

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

SC13: OpenMP and NVIDIA

1. OpenMP and NVIDIA Jeff Larkin NVIDIA Developer Technologies

2. OpenMP and NVIDIA OpenMP is the dominant standard for directive-based parallel programming. NVIDIA joined OpenMP in 2011 to contribute to discussions around parallel accelerators. NVIDIA proposed the TEAMS construct for accelerators in 2012 OpenMP 4.0 with accelerator support released in 2013

3. Why does NVIDIA care about OpenMP?

4. 3 Ways to Accelerate Applications Applications Libraries Compiler Directives Programming Languages “Drop-in” Acceleration Easily Accelerate Applications Maximum Flexibility

5. Reaching a Broader Set of Developers 1,000,000’s Universities Supercomputing Centers Oil & Gas 100,000’s Research Early Adopters 2004 Time CAE CFD Finance Rendering Data Analytics Life Sciences Defense Weather Climate Plasma Physics Present

6. Introduction to OpenMP Teams/Distribute

7. OpenMP PARALLEL FOR Executes the iterations of the next for loop in parallel across a team of threads. #pragma omp parallel for for (i=0; i<N; i++) p[i] = v1[i] * v2[i];

8. OpenMP TARGET PARALLEL FOR Offloads data and execution to a target device, then Executes the iterations of the next for loop in parallel across a team of threads. #pragma omp target #pragma omp parallel for for (i=0; i<N; i++) p[i] = v1[i] * v2[i];

9. OpenMP 4.0 TEAMS/DISTRIBUTE Creates a league of teams on the target device, distributes blocks of work among those teams, and executes the remaining work in parallel within each team This code is portable whether 1 team or many teams are used. #pragma omp target teams #pragma omp distribute parallel for reduction(+:sum) for (i=0; i<N; i++) sum += B[i] * C[i];

10. OpenMP 4.0 Teams Distribute Parallel For #pragma omp target #pragma omp parallel for reduction(+:sum) for (i=0; i< ??;i++) sum += B[i] **C[i]; N; i++) sum += B[i] C[i];

11. OpenMP 4.0 Teams Distribute Parallel For #pragma omp target teams #pragma omp distribute parallel for #pragma omp parallel for reduction(+:sum) for (i=0; i< ??; i++) sum += B[i] * C[i]; The programmer doesn’t have to #pragma omp parallel for reduction(+:sum) think about how the loop is for (i=??; i< ??; i++) sum += B[i] * C[i]; decomposed! #pragma omp parallel for reduction(+:sum) for (i=??; i< ??; i++) sum += B[i] * C[i]; #pragma omp parallel for reduction(+:sum) for (i=??; i< N; i++) sum += B[i] * C[i];

12. OMP + NV: We’re not done yet!

13. OMP + NV: We’re not done yet! Hardware parallelism is not going away, programmers demand a simple, portable way to use it. OpenMP 4.0 is just the first step toward a portable standard for directive-based acceleration We will continue to work with OpenMP to address the challenges of parallel computing Improved Interoperability Improved Portability Improved Expressibility

14. Thank You

SC13: OpenMP and NVIDIA

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a SC13: OpenMP and NVIDIA

Similar a SC13: OpenMP and NVIDIA (20)

Más de Jeff Larkin

Más de Jeff Larkin (15)

Último

Último (20)

SC13: OpenMP and NVIDIA