SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
Jeff Larkin <jlarkin@nvidia.com> Sr. DevTech Software Engineer, OpenACC Technical Committee Chair
Michael Wolfe <mwolfe@nvidia.com> Compiler Engineer
HIGHLY PARALLEL
FORTRAN AND OPENACC
DIRECTIVES
3
39
87 107
150
200
236
SC15 SC16 SC17 SC18 ISC19 SC19
OPENACC DIRECTIVES
a directive-based parallel programming model designed for
usability, performance, and portability
3 OF TOP 5 HPC 18% OF INCITE AT SUMMIT PLATFORMS SUPPORTED
OPENACC APPS >200K DOWNLOADS
NVIDIA GPU
X86 CPU
POWER CPU
Sunway
ARM CPU
AMD GPU
OPENACC SLACK MEMBERS
150
305 361
692
1154
1724
ISC17 SC17 ISC18 SC18 ISC19 SC19
OpenACC Members
6
OpenACC Directives
Manage
Data
Movement
Initiate
Parallel
Execution
Optimize
Loop
Mappings
!$acc data copyin(a,b) copyout(c)
...
!$acc parallel
!$acc loop gang vector
do i = 0, n
c(i) = a(i) + b(i);
...
end do
!$acc end parallel
...
!$acc end data
CPU, GPU, and more
Performance portable
Interoperable
Single source
Incremental
7
THE ROLE OF DIRECTIVES FOR
PARALLEL PROGRAMMING
Serial Programming
Languages
Parallel
Programming
Languages
Directives convey additional information to the compiler.
8
PARALLEL PROGRAMMING CONCERNS
What Why
Express Parallelism What can and should be run in parallel?
Optimize Parallelism How can I execute faster on the parallel hardware I have?
Manage Compute Locality
Executing on the local compute core isn’t enough. Threading and
offloading are now the norm.
Manage Data Locality
Memory is no longer simple and flat. Memory has hierarchies, which are
made even more complex when considering heterogenous architectures.
Asynchronous Operations Asynchronicity is increasingly required to cover various overheads.
9
EXPRESS PARALLELISM IN OPENACC
!$acc parallel (or kernels)
!$acc loop independent
do i=1,N
C(i) = A(i) + B(i)
end do
Not shown here:
• Private Clause
• Reduction Clause
• Atomic directive
What can and should be run in parallel?
• Assert data-independence to the
compiler (can be run in parallel)
• Identify parallelism-blockers (private,
reduction, atomic)
• Hint a desire for the iterations to be
run in parallel (should be run in
parallel)
10
EXPRESS PARALLELISM IN FORTRAN 2018
do concurrent (i=1:N)
C(i) = A(i) + B(i)
end do
C(:) = A(:) + B(:)
Not shown here:
• Local Clause
• Other intrinsics
What can and should be run in parallel?
• Assert ability for iterations to be run
in any order (possibly concurrently)
• Identify privatization of variables, if
necessary
• Currently no loop-level reduction or
atomics
11
OPTIMIZE PARALLELISM IN OPENACC
!$acc loop independent vector(128)
do i=1,N
C(i) = A(i) + B(i)
end do
Not shown here:
• Collapse
• Tile
• Gang
• Worker
• Device_type Specification
How can it run better in parallel?
• Guide the compiler to better parallel
decomposition on a given hardware.
• Potentially transform loops to better
expose parallelism (collapse) or
locality (tile)
• Only analogue in Fortran is re-
structuring or changing compiler
flags. Good enough?
12
MANAGE COMPUTE LOCALITY IN OPENACC
!$acc set device device_num(0)
!$acc& device_type(nvidia)
!$acc parallel loop
do i=1,N
C(i) = A(i) + B(i)
end do
Not shown here:
• Self clause
• Serial or Kernels construct
Where should it be run in parallel?
• Instruct the compiler where to
execute the region
• Or leave it up to the runtime
• Potentially execute on multiple
devices (including the host)
simultaneously
13
MANAGE DATA LOCALITY IN OPENACC
!$acc data copyin(A,B)
!$acc& copyout(C)
!$acc parallel loop
do i=1,N
C(i) = A(i) + B(i)
end do
!$acc end data
Not shown here:
• Unstructured data management
• Update directive
• Deep-copy
• Device data interoperability
• Cache directive
Where should data live and for how long?
• Identify data structures that are
required when distinct memories
exist.
• Reduce data motion between distinct
memories
• Give compiler context for data reuse
14
DATA LOCALITY
Modern GPUs can handle paging memory to/from the GPU
automatically (locality matters)
HMM is a Linux kernel feature for exposing all memory in this
way (ubiquity matters)
Can the 99% work without explicitly optimizing memory locality?
Unified Memory and the Heterogenous Memory Manager (HMM)
Unified Memory
System
Memory
GPU Memory
PCIe
15
ASYNCHRONOUS EXECUTION IN OPENACC
!$acc update device(A) async(1)
!$acc parallel loop async(1)
do i=1,N
C(i) = A(i) + B(i)
end do
!$acc update self(C) async(1)
!$acc wait(1)
What can be run concurrently?
• Expose dependence (or lack of)
between different regions
• Introduce potential for overlapping
operations
• Increase overall system utilization
16
PARALLEL PROGRAMMING CONCERNS
Fortran OpenACC
Identify Parallelism
Optimize
Parallelism
Manage Compute
Locality
Manage Data
Locality
Asynchronous
Operations
Good Enough?
17
CLOVERLEAF V1.3
AWE Hydrodynamics mini-app
6500+ lines, !$acc kernels
OpenACC or OpenMP or do concurrent
Source on GitHub
http://uk-mac.github.io/CloverLeaf
18
FORTRAN WITH OPENACC DIRECTIVES
75 !$ACC KERNELS
76 !$ACC LOOP INDEPENDENT
77 DO k=y_min,y_max
78 !$ACC LOOP INDEPENDENT PRIVATE(right_flux,left_flux,top_flux,bottom_flux,total_flux,
min_cell_volume,energy_change,recip_volume)
79 DO j=x_min,x_max
80
81 left_flux= (xarea(j ,k )*(xvel0(j ,k )+xvel0(j ,k+1) &
82 +xvel0(j ,k )+xvel0(j ,k+1)))*0.25_8*dt*0.5
83 right_flux= (xarea(j+1,k )*(xvel0(j+1,k )+xvel0(j+1,k+1) &
84 +xvel0(j+1,k )+xvel0(j+1,k+1)))*0.25_8*dt*0.5
85 bottom_flux=(yarea(j ,k )*(yvel0(j ,k )+yvel0(j+1,k ) &
86 +yvel0(j ,k )+yvel0(j+1,k )))*0.25_8*dt*0.5
87 top_flux= (yarea(j ,k+1)*(yvel0(j ,k+1)+yvel0(j+1,k+1) &
88 +yvel0(j ,k+1)+yvel0(j+1,k+1)))*0.25_8*dt*0.5
89 total_flux=right_flux-left_flux+top_flux-bottom_flux
90
91 volume_change(j,k)=volume(j,k)/(volume(j,k)+total_flux)
92
93 min_cell_volume=MIN(volume(j,k)+right_flux-left_flux+top_flux-bottom_flux &
94 ,volume(j,k)+right_flux-left_flux &
95 ,volume(j,k)+top_flux-bottom_flux)
97 recip_volume=1.0/volume(j,k)
99 energy_change=(pressure(j,k)/density0(j,k)+viscosity(j,k)/density0(j,k))*...
101 energy1(j,k)=energy0(j,k)-energy_change
103 density1(j,k)=density0(j,k)*volume_change(j,k)
105 ENDDO
106 ENDDO
107 !$ACC END KERNELS
% pgfortran –fast –ta=tesla:managed –Minfo -c
PdV_kernel.f90
pdv_kernel:
...
77, Loop is parallelizable
79, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
77, !$acc loop gang, vector(4) ! blockidx%y
! threadidx%y
79, !$acc loop gang, vector(32)! blockidx%x
! threadidx%x
...
http://uk-mac.github.io/CloverLeaf
19
FORTRAN 2018 DO CONCURRENT
75
76
77 DO CONCURRENT (k=y_min:y_max, j=x_min:x_max) &
78 LOCAL (right_flux,left_flux,top_flux,bottom_flux,total_flux, &
min_cell_volume,energy_change,recip_volume)
79
80
81 left_flux= (xarea(j ,k )*(xvel0(j ,k )+xvel0(j ,k+1) &
82 +xvel0(j ,k )+xvel0(j ,k+1)))*0.25_8*dt*0.5
83 right_flux= (xarea(j+1,k )*(xvel0(j+1,k )+xvel0(j+1,k+1) &
84 +xvel0(j+1,k )+xvel0(j+1,k+1)))*0.25_8*dt*0.5
85 bottom_flux=(yarea(j ,k )*(yvel0(j ,k )+yvel0(j+1,k ) &
86 +yvel0(j ,k )+yvel0(j+1,k )))*0.25_8*dt*0.5
87 top_flux= (yarea(j ,k+1)*(yvel0(j ,k+1)+yvel0(j+1,k+1) &
88 +yvel0(j ,k+1)+yvel0(j+1,k+1)))*0.25_8*dt*0.5
89 total_flux=right_flux-left_flux+top_flux-bottom_flux
90
91 volume_change(j,k)=volume(j,k)/(volume(j,k)+total_flux)
92
93 min_cell_volume=MIN(volume(j,k)+right_flux-left_flux+top_flux-bottom_flux &
94 ,volume(j,k)+right_flux-left_flux &
95 ,volume(j,k)+top_flux-bottom_flux)
97 recip_volume=1.0/volume(j,k)
99 energy_change=(pressure(j,k)/density0(j,k)+viscosity(j,k)/density0(j,k))*...
101 energy1(j,k)=energy0(j,k)-energy_change
103 density1(j,k)=density0(j,k)*volume_change(j,k)
105
106 ENDDO
107
% pgfortran –fast –ta=tesla:managed –Minfo -c
PdV_kernel.f90
pdv_kernel:
...
77, Do concurrent is parallelizable
Accelerator kernel generated
Generating Tesla code
77, !$acc loop gang, vector(4) ! blockidx%y
! threadidx%y
!$acc loop gang, vector(32)! blockidx%x
! threadidx%x
...
https://github.com/UoB-HPC/CloverLeaf_doconcurrent
20
0
1
2
3
4
5
6
7
8
9
10
2-socket Skylake 2-socket Rome 2-socket POWER9 Skylake+1xV100
Intel 2019.5 OpenMP IBM XL 16.1 OpenMP PGI Dev DO CONCURRENT PGI 19.10 OpenACC
1.8x 1.9x 2.0x
FORTRAN 2018 DO CONCURRENT FOR V100
CloverLeaf AWE hydrodynamics mini-App, bm32 data setSpeedupvs2-socketHaswellnode
(biggerisbetter)
Systems: Skylake 2x20 core Xeon Gold server (perf-sky-6) one thread per core, Rome: Two 24 core AMD EPYC 7451 CPUs @ 2.9GHz w/ 256GB memory; POWER9 DD2.1 server (perf-wsn1) two threads per core
Compilers: Intel 2019.5.281, PGI 19.10, IBM XL 16.1.1.3
Benchmark: CloverLeaf v1.3 OpenACC, OpenMP and DoConcurrent versions downloaded from https://github.com/UoB-HPC the week of June 10, 2019
8.3x8.5x
https://github.com/UoB-HPC/CloverLeaf_doconcurrent
2.0x 2.0x1.9x
21
THE FUTURE OF PARALLEL PROGRAMMING
Standard Languages | Directives | Specialized Languages
Maximize Performance with
Specialized Languages & Intrinsics
Drive Base Languages to Better
Support Parallelism
Augment Base Languages with
Directives
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
!$acc data copy(x,y)
...
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
...
!$acc end data
attribute(global)
subroutine saxpy(n, a, x, y) {
int i = blockIdx%x*blockDim%x +
threadIdx%x;
if (i < n) y(i) += a*x(i)
}
program main
real :: x(:), y(:)
real,device :: d_x(:), d_y(:)
d_x = x
d_y = y
call saxpy
<<<(N+255)/256,256>>>(...)
y = d_y
std::for_each_n(POL, idx(0), n,
[&](Index_t i){
y[i] += a*x[i];
});
22
THE FUTURE OF PARALLEL PROGRAMMING
Standard Languages | Directives | Specialized Languages
Maximize Performance with
Specialized Languages & Intrinsics
Drive Base Languages to Better
Support Parallelism
Augment Base Languages with
Directives
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
!$acc data copy(x,y)
...
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
...
!$acc end data
attribute(global)
subroutine saxpy(n, a, x, y) {
int i = blockIdx%x*blockDim%x +
threadIdx%x;
if (i < n) y(i) += a*x(i)
}
program main
real :: x(:), y(:)
real,device :: d_x(:), d_y(:)
d_x = x
d_y = y
call saxpy
<<<(N+255)/256,256>>>(...)
y = d_y
std::for_each_n(POL, idx(0), n,
[&](Index_t i){
y[i] += a*x[i];
});
23
THE FUTURE OF PARALLEL PROGRAMMING
Standard Languages | Directives | Specialized Languages
Maximize Performance with
Specialized Languages & Intrinsics
Drive Base Languages to Better
Support Parallelism
Augment Base Languages with
Directives
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
!$acc data copy(x,y)
...
!$acc parallel loop async
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
...
!$acc end data
attribute(global)
subroutine saxpy(n, a, x, y) {
int i = blockIdx%x*blockDim%x +
threadIdx%x;
if (i < n) y(i) += a*x(i)
}
program main
real :: x(:), y(:)
real,device :: d_x(:), d_y(:)
d_x = x
d_y = y
call saxpy
<<<(N+255)/256,256>>>(...)
y = d_y
std::for_each_n(POL, idx(0), n,
[&](Index_t i){
y[i] += a*x[i];
});
24
FUTURE DIRECTIONS FOR OPENACC AND FORTRAN
DO CONCURRENT – Is this ready for widespread use? Is it enough?
Fortran reductions – Critical for many OpenACC applications. Significant restructuring
otherwise.
Co-Arrays – How do they fit into this picture?
Are there other gaps that need filling? What are our common challenges?
How can the OpenACC and Fortran communities work together more closely?
feedback@openacc.org
25
BACK-UP
26
A BRIEF HISTORY OF OPENACC
2011
Incorporation
ORNL asks CAPS,
Cray, & PGI to unify
efforts with the help
of NVIDIA
Nov.
2011
OpenACC 1.0
Basic parallelism,
structured data, and
async/wait semantics
Oct.
2015
OpenACC 2.5
Reference Counting,
Profiling Interface,
Additional
Improvements from
User Feedback
June
2013
OpenACC 2.0
Unstructured Data
Lifetimes, Routines,
Atomic, Clarifications
& Improvements
Nov
2016
OpenACC 2.6
Serial Construct,
Attach/Detach
(Manual Deep Copy),
Misc. User Feedback
Nov.
2018
OpenACC 2.7
Compute on Self,
readonly, Array
Reductions, Lots of
Clarifications, Misc.
User Feedback
Nov.
2019
OpenACC 3.0
Updated Base
Languages, C++
Lambdas, Zero
modifier, Improved
multi-device support

Más contenido relacionado

La actualidad más candente

Efficient kernel backporting
Efficient kernel backportingEfficient kernel backporting
Efficient kernel backportingLF Events
 
Embedded Graphics Drivers in Mesa (ELCE 2019)
Embedded Graphics Drivers in Mesa (ELCE 2019)Embedded Graphics Drivers in Mesa (ELCE 2019)
Embedded Graphics Drivers in Mesa (ELCE 2019)Igalia
 
An introduction to ROP
An introduction to ROPAn introduction to ROP
An introduction to ROPSaumil Shah
 
An Overview of the IHK/McKernel Multi-kernel Operating System
An Overview of the IHK/McKernel Multi-kernel Operating SystemAn Overview of the IHK/McKernel Multi-kernel Operating System
An Overview of the IHK/McKernel Multi-kernel Operating SystemLinaro
 
ONNC - 0.9.1 release
ONNC - 0.9.1 releaseONNC - 0.9.1 release
ONNC - 0.9.1 releaseLuba Tang
 
JavaOne 2015 : How I Rediscovered My Coding Mojo by Building an IoT/Robotics ...
JavaOne 2015 : How I Rediscovered My Coding Mojo by Building an IoT/Robotics ...JavaOne 2015 : How I Rediscovered My Coding Mojo by Building an IoT/Robotics ...
JavaOne 2015 : How I Rediscovered My Coding Mojo by Building an IoT/Robotics ...Mark West
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...AMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
Evaluation of cloudera impala 1.1
Evaluation of cloudera impala 1.1Evaluation of cloudera impala 1.1
Evaluation of cloudera impala 1.1Yukinori Suda
 
Osic tech talk presentation on ironic inspector
Osic tech talk presentation on ironic inspectorOsic tech talk presentation on ironic inspector
Osic tech talk presentation on ironic inspectorAnnie Lezil
 
Iurii Antykhovych "Java and performance tools and toys"
Iurii Antykhovych "Java and performance tools and toys"Iurii Antykhovych "Java and performance tools and toys"
Iurii Antykhovych "Java and performance tools and toys"LogeekNightUkraine
 
Kubernetes &amp; the 12 factor cloud apps
Kubernetes &amp; the 12 factor cloud appsKubernetes &amp; the 12 factor cloud apps
Kubernetes &amp; the 12 factor cloud appsAna-Maria Mihalceanu
 
Introduction to Debuggers
Introduction to DebuggersIntroduction to Debuggers
Introduction to DebuggersSaumil Shah
 

La actualidad más candente (20)

Efficient kernel backporting
Efficient kernel backportingEfficient kernel backporting
Efficient kernel backporting
 
Embedded Graphics Drivers in Mesa (ELCE 2019)
Embedded Graphics Drivers in Mesa (ELCE 2019)Embedded Graphics Drivers in Mesa (ELCE 2019)
Embedded Graphics Drivers in Mesa (ELCE 2019)
 
An introduction to ROP
An introduction to ROPAn introduction to ROP
An introduction to ROP
 
An Overview of the IHK/McKernel Multi-kernel Operating System
An Overview of the IHK/McKernel Multi-kernel Operating SystemAn Overview of the IHK/McKernel Multi-kernel Operating System
An Overview of the IHK/McKernel Multi-kernel Operating System
 
ONNC - 0.9.1 release
ONNC - 0.9.1 releaseONNC - 0.9.1 release
ONNC - 0.9.1 release
 
SDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLSSDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLS
 
JavaOne 2015 : How I Rediscovered My Coding Mojo by Building an IoT/Robotics ...
JavaOne 2015 : How I Rediscovered My Coding Mojo by Building an IoT/Robotics ...JavaOne 2015 : How I Rediscovered My Coding Mojo by Building an IoT/Robotics ...
JavaOne 2015 : How I Rediscovered My Coding Mojo by Building an IoT/Robotics ...
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
 
RISC-V Zce Extension
RISC-V Zce ExtensionRISC-V Zce Extension
RISC-V Zce Extension
 
SDAccel Design Contest: Vivado
SDAccel Design Contest: VivadoSDAccel Design Contest: Vivado
SDAccel Design Contest: Vivado
 
SDAccel Design Contest: Intro
SDAccel Design Contest: IntroSDAccel Design Contest: Intro
SDAccel Design Contest: Intro
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Evaluation of cloudera impala 1.1
Evaluation of cloudera impala 1.1Evaluation of cloudera impala 1.1
Evaluation of cloudera impala 1.1
 
Osic tech talk presentation on ironic inspector
Osic tech talk presentation on ironic inspectorOsic tech talk presentation on ironic inspector
Osic tech talk presentation on ironic inspector
 
SDAccel Design Contest: SDAccel and F1 Instances
SDAccel Design Contest: SDAccel and F1 InstancesSDAccel Design Contest: SDAccel and F1 Instances
SDAccel Design Contest: SDAccel and F1 Instances
 
SDAccel Design Contest: Xilinx SDAccel
SDAccel Design Contest: Xilinx SDAccel SDAccel Design Contest: Xilinx SDAccel
SDAccel Design Contest: Xilinx SDAccel
 
Iurii Antykhovych "Java and performance tools and toys"
Iurii Antykhovych "Java and performance tools and toys"Iurii Antykhovych "Java and performance tools and toys"
Iurii Antykhovych "Java and performance tools and toys"
 
Kubernetes &amp; the 12 factor cloud apps
Kubernetes &amp; the 12 factor cloud appsKubernetes &amp; the 12 factor cloud apps
Kubernetes &amp; the 12 factor cloud apps
 
Advance ROP Attacks
Advance ROP AttacksAdvance ROP Attacks
Advance ROP Attacks
 
Introduction to Debuggers
Introduction to DebuggersIntroduction to Debuggers
Introduction to Debuggers
 

Similar a FortranCon2020: Highly Parallel Fortran and OpenACC Directives

00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2Yutaka Kawai
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanJimin Hsieh
 
Why scala is not my ideal language and what I can do with this
Why scala is not my ideal language and what I can do with thisWhy scala is not my ideal language and what I can do with this
Why scala is not my ideal language and what I can do with thisRuslan Shevchenko
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
 
Where Did My CPU Go?
Where Did My CPU Go?Where Did My CPU Go?
Where Did My CPU Go?Enkitec
 
Rmoug13 - where did my CPU go?
Rmoug13 - where did my CPU go?Rmoug13 - where did my CPU go?
Rmoug13 - where did my CPU go?Enkitec
 
RMOUG 2013 - Where did my CPU go?
RMOUG 2013 - Where did my CPU go?RMOUG 2013 - Where did my CPU go?
RMOUG 2013 - Where did my CPU go?Kristofferson A
 
Siggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsSiggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsTristan Lorach
 
You're Off the Hook: Blinding Security Software
You're Off the Hook: Blinding Security SoftwareYou're Off the Hook: Blinding Security Software
You're Off the Hook: Blinding Security SoftwareCylance
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0ScyllaDB
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6David Pasek
 
OOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goOOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goKristofferson A
 
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScyllaDB
 
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...Vietnam Open Infrastructure User Group
 
Under The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureUnder The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureScyllaDB
 
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...HostedbyConfluent
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 

Similar a FortranCon2020: Highly Parallel Fortran and OpenACC Directives (20)

00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
 
Why scala is not my ideal language and what I can do with this
Why scala is not my ideal language and what I can do with thisWhy scala is not my ideal language and what I can do with this
Why scala is not my ideal language and what I can do with this
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Where Did My CPU Go?
Where Did My CPU Go?Where Did My CPU Go?
Where Did My CPU Go?
 
Rmoug13 - where did my CPU go?
Rmoug13 - where did my CPU go?Rmoug13 - where did my CPU go?
Rmoug13 - where did my CPU go?
 
RMOUG 2013 - Where did my CPU go?
RMOUG 2013 - Where did my CPU go?RMOUG 2013 - Where did my CPU go?
RMOUG 2013 - Where did my CPU go?
 
Siggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsSiggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentials
 
You're Off the Hook: Blinding Security Software
You're Off the Hook: Blinding Security SoftwareYou're Off the Hook: Blinding Security Software
You're Off the Hook: Blinding Security Software
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
 
OOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goOOW 2013: Where did my CPU go
OOW 2013: Where did my CPU go
 
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
 
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...
 
Under The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureUnder The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database Architecture
 
RISC V in Spacer
RISC V in SpacerRISC V in Spacer
RISC V in Spacer
 
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 

Más de Jeff Larkin

Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupBest Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupJeff Larkin
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
 
Performance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismPerformance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismJeff Larkin
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5Jeff Larkin
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIAJeff Larkin
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesJeff Larkin
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEJeff Larkin
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialJeff Larkin
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsJeff Larkin
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Jeff Larkin
 

Más de Jeff Larkin (16)

Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupBest Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
Performance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismPerformance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive Parallelism
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
 

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

FortranCon2020: Highly Parallel Fortran and OpenACC Directives

  • 1. Jeff Larkin <jlarkin@nvidia.com> Sr. DevTech Software Engineer, OpenACC Technical Committee Chair Michael Wolfe <mwolfe@nvidia.com> Compiler Engineer HIGHLY PARALLEL FORTRAN AND OPENACC DIRECTIVES
  • 2. 3 39 87 107 150 200 236 SC15 SC16 SC17 SC18 ISC19 SC19 OPENACC DIRECTIVES a directive-based parallel programming model designed for usability, performance, and portability 3 OF TOP 5 HPC 18% OF INCITE AT SUMMIT PLATFORMS SUPPORTED OPENACC APPS >200K DOWNLOADS NVIDIA GPU X86 CPU POWER CPU Sunway ARM CPU AMD GPU OPENACC SLACK MEMBERS 150 305 361 692 1154 1724 ISC17 SC17 ISC18 SC18 ISC19 SC19
  • 3.
  • 5. 6 OpenACC Directives Manage Data Movement Initiate Parallel Execution Optimize Loop Mappings !$acc data copyin(a,b) copyout(c) ... !$acc parallel !$acc loop gang vector do i = 0, n c(i) = a(i) + b(i); ... end do !$acc end parallel ... !$acc end data CPU, GPU, and more Performance portable Interoperable Single source Incremental
  • 6. 7 THE ROLE OF DIRECTIVES FOR PARALLEL PROGRAMMING Serial Programming Languages Parallel Programming Languages Directives convey additional information to the compiler.
  • 7. 8 PARALLEL PROGRAMMING CONCERNS What Why Express Parallelism What can and should be run in parallel? Optimize Parallelism How can I execute faster on the parallel hardware I have? Manage Compute Locality Executing on the local compute core isn’t enough. Threading and offloading are now the norm. Manage Data Locality Memory is no longer simple and flat. Memory has hierarchies, which are made even more complex when considering heterogenous architectures. Asynchronous Operations Asynchronicity is increasingly required to cover various overheads.
  • 8. 9 EXPRESS PARALLELISM IN OPENACC !$acc parallel (or kernels) !$acc loop independent do i=1,N C(i) = A(i) + B(i) end do Not shown here: • Private Clause • Reduction Clause • Atomic directive What can and should be run in parallel? • Assert data-independence to the compiler (can be run in parallel) • Identify parallelism-blockers (private, reduction, atomic) • Hint a desire for the iterations to be run in parallel (should be run in parallel)
  • 9. 10 EXPRESS PARALLELISM IN FORTRAN 2018 do concurrent (i=1:N) C(i) = A(i) + B(i) end do C(:) = A(:) + B(:) Not shown here: • Local Clause • Other intrinsics What can and should be run in parallel? • Assert ability for iterations to be run in any order (possibly concurrently) • Identify privatization of variables, if necessary • Currently no loop-level reduction or atomics
  • 10. 11 OPTIMIZE PARALLELISM IN OPENACC !$acc loop independent vector(128) do i=1,N C(i) = A(i) + B(i) end do Not shown here: • Collapse • Tile • Gang • Worker • Device_type Specification How can it run better in parallel? • Guide the compiler to better parallel decomposition on a given hardware. • Potentially transform loops to better expose parallelism (collapse) or locality (tile) • Only analogue in Fortran is re- structuring or changing compiler flags. Good enough?
  • 11. 12 MANAGE COMPUTE LOCALITY IN OPENACC !$acc set device device_num(0) !$acc& device_type(nvidia) !$acc parallel loop do i=1,N C(i) = A(i) + B(i) end do Not shown here: • Self clause • Serial or Kernels construct Where should it be run in parallel? • Instruct the compiler where to execute the region • Or leave it up to the runtime • Potentially execute on multiple devices (including the host) simultaneously
  • 12. 13 MANAGE DATA LOCALITY IN OPENACC !$acc data copyin(A,B) !$acc& copyout(C) !$acc parallel loop do i=1,N C(i) = A(i) + B(i) end do !$acc end data Not shown here: • Unstructured data management • Update directive • Deep-copy • Device data interoperability • Cache directive Where should data live and for how long? • Identify data structures that are required when distinct memories exist. • Reduce data motion between distinct memories • Give compiler context for data reuse
  • 13. 14 DATA LOCALITY Modern GPUs can handle paging memory to/from the GPU automatically (locality matters) HMM is a Linux kernel feature for exposing all memory in this way (ubiquity matters) Can the 99% work without explicitly optimizing memory locality? Unified Memory and the Heterogenous Memory Manager (HMM) Unified Memory System Memory GPU Memory PCIe
  • 14. 15 ASYNCHRONOUS EXECUTION IN OPENACC !$acc update device(A) async(1) !$acc parallel loop async(1) do i=1,N C(i) = A(i) + B(i) end do !$acc update self(C) async(1) !$acc wait(1) What can be run concurrently? • Expose dependence (or lack of) between different regions • Introduce potential for overlapping operations • Increase overall system utilization
  • 15. 16 PARALLEL PROGRAMMING CONCERNS Fortran OpenACC Identify Parallelism Optimize Parallelism Manage Compute Locality Manage Data Locality Asynchronous Operations Good Enough?
  • 16. 17 CLOVERLEAF V1.3 AWE Hydrodynamics mini-app 6500+ lines, !$acc kernels OpenACC or OpenMP or do concurrent Source on GitHub http://uk-mac.github.io/CloverLeaf
  • 17. 18 FORTRAN WITH OPENACC DIRECTIVES 75 !$ACC KERNELS 76 !$ACC LOOP INDEPENDENT 77 DO k=y_min,y_max 78 !$ACC LOOP INDEPENDENT PRIVATE(right_flux,left_flux,top_flux,bottom_flux,total_flux, min_cell_volume,energy_change,recip_volume) 79 DO j=x_min,x_max 80 81 left_flux= (xarea(j ,k )*(xvel0(j ,k )+xvel0(j ,k+1) & 82 +xvel0(j ,k )+xvel0(j ,k+1)))*0.25_8*dt*0.5 83 right_flux= (xarea(j+1,k )*(xvel0(j+1,k )+xvel0(j+1,k+1) & 84 +xvel0(j+1,k )+xvel0(j+1,k+1)))*0.25_8*dt*0.5 85 bottom_flux=(yarea(j ,k )*(yvel0(j ,k )+yvel0(j+1,k ) & 86 +yvel0(j ,k )+yvel0(j+1,k )))*0.25_8*dt*0.5 87 top_flux= (yarea(j ,k+1)*(yvel0(j ,k+1)+yvel0(j+1,k+1) & 88 +yvel0(j ,k+1)+yvel0(j+1,k+1)))*0.25_8*dt*0.5 89 total_flux=right_flux-left_flux+top_flux-bottom_flux 90 91 volume_change(j,k)=volume(j,k)/(volume(j,k)+total_flux) 92 93 min_cell_volume=MIN(volume(j,k)+right_flux-left_flux+top_flux-bottom_flux & 94 ,volume(j,k)+right_flux-left_flux & 95 ,volume(j,k)+top_flux-bottom_flux) 97 recip_volume=1.0/volume(j,k) 99 energy_change=(pressure(j,k)/density0(j,k)+viscosity(j,k)/density0(j,k))*... 101 energy1(j,k)=energy0(j,k)-energy_change 103 density1(j,k)=density0(j,k)*volume_change(j,k) 105 ENDDO 106 ENDDO 107 !$ACC END KERNELS % pgfortran –fast –ta=tesla:managed –Minfo -c PdV_kernel.f90 pdv_kernel: ... 77, Loop is parallelizable 79, Loop is parallelizable Accelerator kernel generated Generating Tesla code 77, !$acc loop gang, vector(4) ! blockidx%y ! threadidx%y 79, !$acc loop gang, vector(32)! blockidx%x ! threadidx%x ... http://uk-mac.github.io/CloverLeaf
  • 18. 19 FORTRAN 2018 DO CONCURRENT 75 76 77 DO CONCURRENT (k=y_min:y_max, j=x_min:x_max) & 78 LOCAL (right_flux,left_flux,top_flux,bottom_flux,total_flux, & min_cell_volume,energy_change,recip_volume) 79 80 81 left_flux= (xarea(j ,k )*(xvel0(j ,k )+xvel0(j ,k+1) & 82 +xvel0(j ,k )+xvel0(j ,k+1)))*0.25_8*dt*0.5 83 right_flux= (xarea(j+1,k )*(xvel0(j+1,k )+xvel0(j+1,k+1) & 84 +xvel0(j+1,k )+xvel0(j+1,k+1)))*0.25_8*dt*0.5 85 bottom_flux=(yarea(j ,k )*(yvel0(j ,k )+yvel0(j+1,k ) & 86 +yvel0(j ,k )+yvel0(j+1,k )))*0.25_8*dt*0.5 87 top_flux= (yarea(j ,k+1)*(yvel0(j ,k+1)+yvel0(j+1,k+1) & 88 +yvel0(j ,k+1)+yvel0(j+1,k+1)))*0.25_8*dt*0.5 89 total_flux=right_flux-left_flux+top_flux-bottom_flux 90 91 volume_change(j,k)=volume(j,k)/(volume(j,k)+total_flux) 92 93 min_cell_volume=MIN(volume(j,k)+right_flux-left_flux+top_flux-bottom_flux & 94 ,volume(j,k)+right_flux-left_flux & 95 ,volume(j,k)+top_flux-bottom_flux) 97 recip_volume=1.0/volume(j,k) 99 energy_change=(pressure(j,k)/density0(j,k)+viscosity(j,k)/density0(j,k))*... 101 energy1(j,k)=energy0(j,k)-energy_change 103 density1(j,k)=density0(j,k)*volume_change(j,k) 105 106 ENDDO 107 % pgfortran –fast –ta=tesla:managed –Minfo -c PdV_kernel.f90 pdv_kernel: ... 77, Do concurrent is parallelizable Accelerator kernel generated Generating Tesla code 77, !$acc loop gang, vector(4) ! blockidx%y ! threadidx%y !$acc loop gang, vector(32)! blockidx%x ! threadidx%x ... https://github.com/UoB-HPC/CloverLeaf_doconcurrent
  • 19. 20 0 1 2 3 4 5 6 7 8 9 10 2-socket Skylake 2-socket Rome 2-socket POWER9 Skylake+1xV100 Intel 2019.5 OpenMP IBM XL 16.1 OpenMP PGI Dev DO CONCURRENT PGI 19.10 OpenACC 1.8x 1.9x 2.0x FORTRAN 2018 DO CONCURRENT FOR V100 CloverLeaf AWE hydrodynamics mini-App, bm32 data setSpeedupvs2-socketHaswellnode (biggerisbetter) Systems: Skylake 2x20 core Xeon Gold server (perf-sky-6) one thread per core, Rome: Two 24 core AMD EPYC 7451 CPUs @ 2.9GHz w/ 256GB memory; POWER9 DD2.1 server (perf-wsn1) two threads per core Compilers: Intel 2019.5.281, PGI 19.10, IBM XL 16.1.1.3 Benchmark: CloverLeaf v1.3 OpenACC, OpenMP and DoConcurrent versions downloaded from https://github.com/UoB-HPC the week of June 10, 2019 8.3x8.5x https://github.com/UoB-HPC/CloverLeaf_doconcurrent 2.0x 2.0x1.9x
  • 20. 21 THE FUTURE OF PARALLEL PROGRAMMING Standard Languages | Directives | Specialized Languages Maximize Performance with Specialized Languages & Intrinsics Drive Base Languages to Better Support Parallelism Augment Base Languages with Directives do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo !$acc data copy(x,y) ... do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo ... !$acc end data attribute(global) subroutine saxpy(n, a, x, y) { int i = blockIdx%x*blockDim%x + threadIdx%x; if (i < n) y(i) += a*x(i) } program main real :: x(:), y(:) real,device :: d_x(:), d_y(:) d_x = x d_y = y call saxpy <<<(N+255)/256,256>>>(...) y = d_y std::for_each_n(POL, idx(0), n, [&](Index_t i){ y[i] += a*x[i]; });
  • 21. 22 THE FUTURE OF PARALLEL PROGRAMMING Standard Languages | Directives | Specialized Languages Maximize Performance with Specialized Languages & Intrinsics Drive Base Languages to Better Support Parallelism Augment Base Languages with Directives do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo !$acc data copy(x,y) ... do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo ... !$acc end data attribute(global) subroutine saxpy(n, a, x, y) { int i = blockIdx%x*blockDim%x + threadIdx%x; if (i < n) y(i) += a*x(i) } program main real :: x(:), y(:) real,device :: d_x(:), d_y(:) d_x = x d_y = y call saxpy <<<(N+255)/256,256>>>(...) y = d_y std::for_each_n(POL, idx(0), n, [&](Index_t i){ y[i] += a*x[i]; });
  • 22. 23 THE FUTURE OF PARALLEL PROGRAMMING Standard Languages | Directives | Specialized Languages Maximize Performance with Specialized Languages & Intrinsics Drive Base Languages to Better Support Parallelism Augment Base Languages with Directives do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo !$acc data copy(x,y) ... !$acc parallel loop async do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo ... !$acc end data attribute(global) subroutine saxpy(n, a, x, y) { int i = blockIdx%x*blockDim%x + threadIdx%x; if (i < n) y(i) += a*x(i) } program main real :: x(:), y(:) real,device :: d_x(:), d_y(:) d_x = x d_y = y call saxpy <<<(N+255)/256,256>>>(...) y = d_y std::for_each_n(POL, idx(0), n, [&](Index_t i){ y[i] += a*x[i]; });
  • 23. 24 FUTURE DIRECTIONS FOR OPENACC AND FORTRAN DO CONCURRENT – Is this ready for widespread use? Is it enough? Fortran reductions – Critical for many OpenACC applications. Significant restructuring otherwise. Co-Arrays – How do they fit into this picture? Are there other gaps that need filling? What are our common challenges? How can the OpenACC and Fortran communities work together more closely? feedback@openacc.org
  • 25. 26 A BRIEF HISTORY OF OPENACC 2011 Incorporation ORNL asks CAPS, Cray, & PGI to unify efforts with the help of NVIDIA Nov. 2011 OpenACC 1.0 Basic parallelism, structured data, and async/wait semantics Oct. 2015 OpenACC 2.5 Reference Counting, Profiling Interface, Additional Improvements from User Feedback June 2013 OpenACC 2.0 Unstructured Data Lifetimes, Routines, Atomic, Clarifications & Improvements Nov 2016 OpenACC 2.6 Serial Construct, Attach/Detach (Manual Deep Copy), Misc. User Feedback Nov. 2018 OpenACC 2.7 Compute on Self, readonly, Array Reductions, Lots of Clarifications, Misc. User Feedback Nov. 2019 OpenACC 3.0 Updated Base Languages, C++ Lambdas, Zero modifier, Improved multi-device support