SlideShare a Scribd company logo
1 of 85
Download to read offline
Automated CUDA-to-OpenCL
Translation with CU2CL:
What’s Next?

Wu Feng and Mark Gardner
Virginia Tech
2013-11-12
synergy.cs.vt.edu
Why OpenCL?

http://www2.pcmag.com/media/imag
es/375584-nvidia-geforce-gtx-titan.jp
g?thumb=y

http://www.amd.com/PublishingImages/P
ublic/Photograph_ProductShots/375WPN
G/61979.png

http://www.hardwarezone.com.sg/file
s/img/2012/06/Xeon_Phi_PCIe_Card_
M.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

http://www.thinkcomputers.org/articl
es/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revi
mages/cpu/Core_I7_965/New_Core_I7.j
pg

synergy.cs.vt.edu
Why OpenCL?

http://www2.pcmag.com/media/imag
es/375584-nvidia-geforce-gtx-titan.jp
g?thumb=y

http://www.amd.com/PublishingImages/P
ublic/Photograph_ProductShots/375WPN
G/61979.png

http://www.hardwarezone.com.sg/file
s/img/2012/06/Xeon_Phi_PCIe_Card_
M.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

http://www.thinkcomputers.org/articl
es/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revi
mages/cpu/Core_I7_965/New_Core_I7.j
pg

synergy.cs.vt.edu
Why OpenCL?

http://www2.pcmag.com/media/imag
es/375584-nvidia-geforce-gtx-titan.jp
g?thumb=y

http://www.amd.com/PublishingImages/P
ublic/Photograph_ProductShots/375WPN
G/61979.png

http://www.hardwarezone.com.sg/file
s/img/2012/06/Xeon_Phi_PCIe_Card_
M.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

http://www.thinkcomputers.org/articl
es/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revi
mages/cpu/Core_I7_965/New_Core_I7.j
pg

synergy.cs.vt.edu
Why OpenCL?

Source code lasts longer than platforms

http://www2.pcmag.com/media/imag
es/375584-nvidia-geforce-gtx-titan.jp
g?thumb=y

http://www.amd.com/PublishingImages/P
ublic/Photograph_ProductShots/375WPN
G/61979.png

http://www.hardwarezone.com.sg/file
s/img/2012/06/Xeon_Phi_PCIe_Card_
M.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

http://www.thinkcomputers.org/articl
es/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revi
mages/cpu/Core_I7_965/New_Core_I7.j
pg

synergy.cs.vt.edu
The Goal
To take advantage of OpenCL's portability...

http://people.emich.edu/akavetsk/424/scribeatdesk_1.jpg

Without sacrificing man-years of existing code
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA and OpenCL APIs
CUDA Module

OpenCL Module

Thread

Contexts &
Command Queues

Device

Platforms & Devices

Stream

Command Queues

Event

Events

Memory

Memory Objects

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA and OpenCL APIs
CUDA Module

OpenCL Module

Thread

Contexts &
Command Queues

Device

Platforms & Devices

Stream

Command Queues

Event

Events

Memory

Memory Objects

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA and OpenCL APIs
CUDA Module

OpenCL Module

Thread

Contexts &
Command Queues

Device

Platforms & Devices

Stream

Command Queues

Event

Events

Memory

Memory Objects

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA and OpenCL Data
CUDA

OpenCL

Vector types (e.g. float4)

Host: cl_float4
Kernel: float4

dim3

size_t[3]

cudaStream_t

cl_command_queue

cudaEvent_t

cl_event

Device pointers (e.g. float* created cl_mem created through
through cudaMalloc)
clCreateBuffer
cudaChannelFormat

cl_image_format

textureReference

cl_mem created through
clCreateImage

cudaDeviceProp

No direct equivalent

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA and OpenCL Data
CUDA

OpenCL

Vector types (e.g. float4)

Host: cl_float4
Kernel: float4

dim3

size_t[3]

cudaStream_t

cl_command_queue

cudaEvent_t

cl_event

Device pointers (e.g. float* created cl_mem created through
through cudaMalloc)
clCreateBuffer
cudaChannelFormat

cl_image_format

textureReference

cl_mem created through
clCreateImage

cudaDeviceProp

No direct equivalent

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA and OpenCL Data
CUDA

OpenCL

Vector types (e.g. float4)

Host: cl_float4
Kernel: float4

dim3

size_t[3]

cudaStream_t

cl_command_queue

cudaEvent_t

cl_event

Device pointers (e.g. float* created cl_mem created through
through cudaMalloc)
clCreateBuffer
cudaChannelFormat

cl_image_format

textureReference

cl_mem created through
clCreateImage

cudaDeviceProp

No direct equivalent

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA and OpenCL Data
CUDA

OpenCL

Vector types (e.g. float4)

Host: cl_float4
Kernel: float4

dim3

size_t[3]

cudaStream_t

cl_command_queue

cudaEvent_t

cl_event

Device pointers (e.g. float* created cl_mem created through
through cudaMalloc)
clCreateBuffer
cudaChannelFormat

cl_image_format

textureReference

cl_mem created through
clCreateImage

cudaDeviceProp

No direct equivalent

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA and OpenCL
Execution and Memory Models

synergy.cs.vt.edu
The Problem

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
The Problem
MnaTastn
aulr li
nao
(ek, ot )
w esm n s
h
CD
UA
Su e
or
c
Cd
oe

O eC
pnL
Su e
or
c
Cd
oe
xkcd.com

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
The Problem
MnaTastn
aulr li
nao
(ek, ot )
w esm n s
h
CD
UA
Su e
or
c
Cd
oe

O eC
pnL
Su e
or
c
Cd
oe
xkcd.com

A tm t Tast n
u ac r li
o i nao
( cns
s od)
e

C 2L
UC

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Forecast

http://www.weather.com/weather/5-day/San+Jose+CA+USCA0993:1:US

•
•
•
•
•

Observations about Translating
Examples: CUDA and OpenCL constructs
CU2CL Architecture
Current State of CU2CL: Robustness and Performance
Future Directions
synergy.cs.vt.edu
Translation Is Easy ...

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translation Is Easy ...
…when there is NO ambiguity in the translation between
languages (i.e., there is a direct mapping)

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translation Is Easy ...
…when there is NO ambiguity in the translation between
languages (i.e., there is a direct mapping)
• High-level language → low-level representation, e.g., C →
LLVM
x*y+z→
%tmp = mul i32 %x, %y
%tmp2 = add i32 %tmp, %z

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translation Is Easy ...
…when there is NO ambiguity in the translation between
languages (i.e., there is a direct mapping)
• High-level language → low-level representation, e.g., C →
LLVM
x*y+z→
%tmp = mul i32 %x, %y
%tmp2 = add i32 %tmp, %z
• Between languages, e.g., CUDA → OpenCL
__powf(x[threadIdx.x], y[threadIdx.y]) →
native_pow(x[get_local_id(0)], y[get_local_id(1)])
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translation is more difficult

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translation is more difficult
…when there IS ambiguity (or lack of a direct
mapping) in the translation between languages

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translation is more difficult
…when there IS ambiguity (or lack of a direct
mapping) in the translation between languages
• Idiomatic Expressions
– “Putting all your eggs in one basket” → ?? in Spanish
– CUDA threadfence() → OpenCL ??

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translation is more difficult
…when there IS ambiguity (or lack of a direct
mapping) in the translation between languages
• Idiomatic Expressions
– “Putting all your eggs in one basket” → ?? in Spanish
– CUDA threadfence() → OpenCL ??

• Dialects
– Latin American Spanish vs. Castilian Spanish → English
– CUDA Runtime API vs. CUDA Driver API → OpenCL

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA and OpenCL

http://www.dragon1.com/images/examples.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA Initialization Code

None
(Implicit)

Dialect: CUDA runtime API
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
OpenCL Initialization Code
Explicit
//get a platform and device, set up a context and command queue
clGetPlatformIDs(1, &__cu2cl_Platform, NULL);
clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);
__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);
__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device,
CL_QUEUE_PROFILING_ENABLE, NULL);
//read kernel source from disk
FILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r");
fseek(f, 0, SEEK_END);
size_t progLen = (size_t) ftell(f);
const char * progSrc = (const char *) malloc(sizeof(char)*len);
rewind(f);
fread((void *) progSrc, len, 1, f);
fclose(f);
//build device program and kernel
__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc,
&progLen, NULL);
free((void *) progSrc);
clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);
__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
OpenCL Initialization Code
Explicit
//get a platform and device, set up a context and command queue
clGetPlatformIDs(1, &__cu2cl_Platform, NULL);
clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);
__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);
__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device,
CL_QUEUE_PROFILING_ENABLE, NULL);
//read kernel source from disk
FILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r");
fseek(f, 0, SEEK_END);
size_t progLen = (size_t) ftell(f);
const char * progSrc = (const char *) malloc(sizeof(char)*len);
rewind(f);
fread((void *) progSrc, len, 1, f);
fclose(f);
//build device program and kernel
__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc,
&progLen, NULL);
free((void *) progSrc);
clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);
__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
OpenCL Initialization Code
Explicit
//get a platform and device, set up a context and command queue
clGetPlatformIDs(1, &__cu2cl_Platform, NULL);
clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);
__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);
__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device,
CL_QUEUE_PROFILING_ENABLE, NULL);
//read kernel source from disk
FILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r");
fseek(f, 0, SEEK_END);
size_t progLen = (size_t) ftell(f);
const char * progSrc = (const char *) malloc(sizeof(char)*len);
rewind(f);
fread((void *) progSrc, len, 1, f);
fclose(f);
//build device program and kernel
__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc,
&progLen, NULL);
free((void *) progSrc);
clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);
__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA Kernel Invocation
// setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(uiWC / threads.x, uiHC / threads.y);
// execute the kernel
int nIter = 30;
for (int j = 0; j < nIter; j++)
{
matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB);
}

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA Kernel Invocation
// setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(uiWC / threads.x, uiHC / threads.y);
// execute the kernel
int nIter = 30;
for (int j = 0; j < nIter; j++)
{
matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB);
}

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA Kernel Invocation
// setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(uiWC / threads.x, uiHC / threads.y);
// execute the kernel
int nIter = 30;
for (int j = 0; j < nIter; j++)
{
matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB);
}

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
OpenCL Kernel Invocation
// setup execution parameters
size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1};
size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};
// execute the kernel
int nIter = 30;
for (int j = 0; j < nIter; j++)
{
clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB);
localWorkSize[0] = threads[0];
localWorkSize[1] = threads[1];
localWorkSize[2] = threads[2];
globalWorkSize[0] = grid[0]*localWorkSize[0];
globalWorkSize[1] = grid[1]*localWorkSize[1];
globalWorkSize[2] = grid[2]*localWorkSize[2];
clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL,
globalWorkSize,localWorkSize, 0, NULL, NULL);
}

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
OpenCL Kernel Invocation
// setup execution parameters
size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1};
size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};
// execute the kernel
int nIter = 30;
for (int j = 0; j < nIter; j++)
{
clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB);
localWorkSize[0] = threads[0];
localWorkSize[1] = threads[1];
localWorkSize[2] = threads[2];
globalWorkSize[0] = grid[0]*localWorkSize[0];
globalWorkSize[1] = grid[1]*localWorkSize[1];
globalWorkSize[2] = grid[2]*localWorkSize[2];
clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL,
globalWorkSize,localWorkSize, 0, NULL, NULL);
}

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
OpenCL Kernel Invocation
// setup execution parameters
size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1};
size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};
// execute the kernel
int nIter = 30;
for (int j = 0; j < nIter; j++)
{
clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB);
localWorkSize[0] = threads[0];
localWorkSize[1] = threads[1];
localWorkSize[2] = threads[2];
globalWorkSize[0] = grid[0]*localWorkSize[0];
globalWorkSize[1] = grid[1]*localWorkSize[1];
globalWorkSize[2] = grid[2]*localWorkSize[2];
clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL,
globalWorkSize,localWorkSize, 0, NULL, NULL);
}

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Kernel Code for Vector Add
CUDA
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}

OpenCL
// Device code
__kernel void VecAdd(const __global float* A, const __global float* B,
__global float* C, int N) {
int i = get_local_size(0) * get_group_id(0) + get_local_id(0);
if (i < N)
C[i] = A[i] + B[i];
}

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Kernel Code for Vector Add
CUDA
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}

OpenCL
// Device code
__kernel void VecAdd(const __global float* A, const __global float* B,
__global float* C, int N) {
int i = get_local_size(0) * get_group_id(0) + get_local_id(0);
if (i < N)
C[i] = A[i] + B[i];
}

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Kernel Code for Vector Add
CUDA
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}

OpenCL
// Device code
__kernel void VecAdd(const __global float* A, const __global float* B,
__global float* C, int N) {
int i = get_local_size(0) * get_group_id(0) + get_local_id(0);
if (i < N)
C[i] = A[i] + B[i];
}

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Kernel Code for Vector Add
CUDA
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}

OpenCL
// Device code
__kernel void VecAdd(const __global float* A, const __global float* B,
__global float* C, int N) {
int i = get_local_size(0) * get_group_id(0) + get_local_id(0);
if (i < N)
C[i] = A[i] + B[i];
}

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CU2CL Architecture

http://dotsconnectedkat.files.wordpress.com/2011/02/agrigento.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Compilation Process

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Compilation Process
Preprocessor
Source
Code

Lexer

Preprocessed
Code

Semantic
Analyzer

Parser
Tokenized
Code

Parse
Tree

Code
Generator

Intermediate
Representation

Binary

LLVM

Clang

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Compilation Process
Preprocessor
Source
Code

Lexer

Preprocessed
Code

Semantic
Analyzer

Parser
Tokenized
Code

Parse
Tree

Code
Generator

Intermediate
Representation

Binary

LLVM

Clang

Martinez, Gardner, and Feng, “CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures,” IEEE ICPADS 2011

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
AST-driven, String-based Rewriting
CUDA

OpenCL

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[get_local_id(1)])

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])

Func

OpenCL

native_pow(x[get_local_id(0)], y[get_local_id(1)])

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])

Func
Arg
Arg

OpenCL

native_pow(x[get_local_id(0)], y[get_local_id(1)])

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])

Func
Arg
Arg

OpenCL

Struct
Struct

native_pow(x[get_local_id(0)], y[get_local_id(1)])

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])

Func
Arg

Field

Arg

OpenCL

Struct
Struct

Field

native_pow(x[get_local_id(0)], y[get_local_id(1)])

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])

Func
Arg

Field

Arg

OpenCL

Struct
Struct

Field

native_pow(x[get_local_id(0)], y[get_local_id(1)])

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])

Func
Arg

Field

0

Arg

OpenCL

Struct
Struct

Field

1

native_pow(x[get_local_id(0)], y[get_local_id(1)])

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )

Func
Arg

Field

0

Arg

OpenCL

Struct
Struct

Field

1

native_pow(x[get_local_id(0)], y[get_local_id(1)])

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )

Func
Arg

Struct

Field

0

Arg

Struct

Field

1

x[

OpenCL

]

y[

]

native_pow(x[get_local_id(0)], y[get_local_id(1)])

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )

Func
Arg

OpenCL

Field

0

Arg
native_pow

Struct
Struct

Field

1

x[

]

y[

]

native_pow(x[get_local_id(0)], y[get_local_id(1)])

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )

Func
Arg

OpenCL

Field

0

Arg
native_pow

Struct
Struct

Field

1

x[

Write Out

]

y[

]

native_pow(x[get_local_id(0)], y[get_local_id(1)])

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )

Func
Arg

OpenCL

Field

0

Arg
native_pow

Struct
Struct

Field

1

x[

Write Out

]

y[

]

native_pow(x[get_local_id(0)], y[get_local_id(1)])

Advantage: formatting remains intact → maintainable
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Complex Semantic Conversions

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Complex Semantic Conversions
1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-reference

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Complex Semantic Conversions
1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-reference
CUDA Kernel Launch
kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Complex Semantic Conversions
1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-reference
CUDA Kernel Launch
kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);

Naive OpenCL Translation
clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);
clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);
clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Complex Semantic Conversions
1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-reference
CUDA Kernel Launch
kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);

Naive OpenCL Translation
clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);
clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);
clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);

Correct OpenCL Translation
clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);
float __cu2cl_Kernel_kernel_arg_1 = foo2 * 2.0f;
clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float),
&__cu2cl_Kernel_kernel_arg_1);
int __cu2cl_Kernel_kernel_arg_2 = 256;
clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int),
&__cu2cl_Kernel_kernel_arg_2);
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Complex Semantic Conversions

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Complex Semantic Conversions
2. Device Identification
– CUDA uses int, OpenCL uses opaque cl_device
– To change devices in CUDA, use cudaSetDevice(int id)
– To change devices in OpenCL, use...

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Complex Semantic Conversions
2. Device Identification
– CUDA uses int, OpenCL uses opaque cl_device
– To change devices in CUDA, use cudaSetDevice(int id)
– To change devices in OpenCL, use...
//scan all devices
//save old platform, device, context, queue, program, & kernels
myDevice = allDevices[id]
ClGetDeviceInfo(...);
//get new device's platform
myContext = clCreateContext(...);
myQueue = clCreateCommandQueue(...);
//load program source
clBuildProgram(...);
myKernel = clCreateKernel(...);

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Complex Semantic Conversions
2. Device Identification
– CUDA uses int, OpenCL uses opaque cl_device
– To change devices in CUDA, use cudaSetDevice(int id)
– To change devices in OpenCL, use...
//scan all devices
//save old platform, device, context, queue, program, & kernels
myDevice = allDevices[id]
ClGetDeviceInfo(...);
//get new device's platform
myContext = clCreateContext(...);
myQueue = clCreateCommandQueue(...);
//load program source
clBuildProgram(...);
myKernel = clCreateKernel(...);

– Implement our own handler to emulate and encapsulate

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CU2CL Evaluation

Image: http://learn.cvuhs.org/file.php/1427/scales_of_justice2.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Test Code

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Test Code
• 79 CUDA SDK Samples
• 17 Rodinia Samples
• Applications
– GEM – Molecular Modeling
– IZ PS – Neural Network
– Fen Zi – Molecular Dynamics

• 100k+ SLOC in total

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Test Code
• 79 CUDA SDK Samples
• 17 Rodinia Samples
• Applications
– GEM – Molecular Modeling
– IZ PS – Neural Network
– Fen Zi – Molecular Dynamics

• 100k+ SLOC in total

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Test Code
• 79 CUDA SDK Samples
• 17 Rodinia Samples
• Applications
– GEM – Molecular Modeling
– IZ PS – Neural Network
– Fen Zi – Molecular Dynamics

• 100k+ SLOC in total

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translator Coverage
O eC L e
pnL i s
n
C agd
hne

Pr n
e et
c
A tm ta Tast
u acl r le
o i l na d
y

1
3
5

5

9.
6
3

bnwdh et
ad i T s
t

81
9

5

9.
8
9

B cShl
l kco s
a
e

37
4

1
4

9.
6
0

Fs l Tas r
at s r f m
Wah n o

37
2

3
0

9.
0
8

m t Ml
ai u
r
x

31
5

9

9.
7
4

sar o
clP d
ar

21
5

1
8

9.
2
8

vco d
etr d
A

1
4
7

0

10
0

Bc P pgt n
ak r aao
o
i
Rd i
oi
n
a

C D Le
U A is
n

ayc P
snA I
S KSm l
D aps
e

A pctn
plao
ii

3
1
3

2
4

9.
2
3

B at FsSa h
r dh it er
e -r
c

36
0

3
5

8.
8
6

Gusn
asi
a

30
9

2
6

9.
3
3

Ht o
os t
p

38
2

2

9.
9
4

Nel a- nc
ed m nWush
e

40
3

3

9.
9
3

Fn i
eZ

16
78
7

16
7
8

8.
9
9

GM
E

54
2

1
5

9.
7
1

IP
ZS

80
42

1
6
6

9.
8
0

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translator Coverage
O eC L e
pnL i s
n
C agd
hne

Pr n
e et
c
A tm ta Tast
u acl r le
o i l na d
y

1
3
5

5

9.
6
3

bnwdh et
ad i T s
t

81
9

5

9.
8
9

B cShl
l kco s
a
e

37
4

1
4

9.
6
0

Fs l Tas r
at s r f m
Wah n o

37
2

3
0

9.
0
8

m t Ml
ai u
r
x

31
5

9

9.
7
4

sar o
clP d
ar

21
5

1
8

9.
2
8

vco d
etr d
A

1
4
7

0

10
0

Bc P pgt n
ak r aao
o
i
Rd i
oi
n
a

C D Le
U A is
n

ayc P
snA I
S KSm l
D aps
e

A pctn
plao
ii

3
1
3

2
4

9.
2
3

B at FsSa h
r dh it er
e -r
c

36
0

3
5

8.
8
6

Gusn
asi
a

30
9

2
6

9.
3
3

Ht o
os t
p

38
2

2

9.
9
4

Nel a- nc
ed m nWush
e

40
3

3

9.
9
3

Fn i
eZ

16
78
7

16
7
8

8.
9
9

GM
E

54
2

1
5

9.
7
1

IP
ZS

80
42

1
6
6

9.
8
0

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translation Challenges
Identified

P fd
ri
ol
e
C a ne
hl g
l
e

C D SK
UA D
F qec ( )
r uny %
e

Rd i
oi
n
a
F qec ( )
r uny %
e

D v ednfr
ei I tis
c ei
e

5.
4
4

2.
9
4

Le la m t s
ir Pr e r
ta a e

10
9
.

2.
3
5

Spre o pao
ea t C m i i
a
l n
t

5.
4
4

2.
9
4

C D L ri
U A iae
b rs

1.
0
1

0

K r le p ts
e eT m le
n
a

25
1
.

0

T x rM m r
et e e o
u
y

2.
7
8

2.
3
5

Gah sn r e bi 2.
r i Ie pr iy 4
p c to a l
t
1

0

C nt t e o
os nM m r
a
y

17
7
.

2.
9
4

Sa d e o
hr M m r
e
y

4.
6
8

Kernel Function Pointer Invocations
Preprocessor Effects
Warp-level Synchronization
Device Intrinsic Functions
Device Buffer cl_mem Type
Propagation
#defined Function Definitions
Device Buffers as Struct Members
Arrays of Device Buffers
Implicitly-Defined Kernel Functions
Device-side Classes, Constructors,
& Destructors
Struct Alignment Atbt
ti e
ru s
_t edec(
_h af e
r n )

7.
0
6

Sathre, Gardner, Feng: “Lost in Translation: Challenges in Automating CUDA-to-OpenCL
Translation”. ICPP Workshops 2012: 89-96
Gardner, Feng, Sathre, Martinez: “Characterizing the Challenges and Evaluating the Efficacy of
a CUDA-to-OpenCL Translator”. ParCo Special Issue 2013, to appear
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translator Performance
10 0
00

1
0

R= .
+01
6
R= .
+05
9

10
00

Total Translation Time
(s)

1

CU2CL Translation Time
(microseconds)

10
0

0
.
1

01
.
0
10
0

S KSm l
D a ps
e

1
0

Su e i s
or L e
c n
10
00

Rd iSm l
o ia a p s
n
e

10 0
00

10 0
000

Lre plaos
a A pctn
g
i i

1
10
0

Su e i s
or L e
c n
10
00

S KSm l
D a ps
e
Rd iSm l
o ia a p s
n
e
Lre plaos
a A pctn
g
i i

10 0
00

10 0
000

L er D Sm l )
i a( K a p s
n S
e
L er o ia a p s
i a( d iSm l )
n R n
e

Experimental Setup: AMD Phenom II X6 1090T (six-cores 3.2Ghz), 16 GB RAM, NVIDIA GeForce
GTX 480 (driver version 310.32, CUDA Runtime 5.0), 64-bit Ubuntu 12.04
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translated Application Performance
2
.
5
S KSm l
D aps
e

Rd iSm l
oi a p s
n
a
e

Time (s)
CUDA OpenCL

2

Lower is Better
1
.
5

1

GM
E

Ne
ed
la
e
mn
-u
Wn
sh
c

Ht
os
pt
o

Gu
as
sn
i
a

BS
F

bc
ak
po
rp

vc
et
od
rd
A

sa
cl
ar
ro
Pd

mt
ai
rM
xu
l

Fs
at
Wa
lT
s
hr
as
nf
om
r

Bc
lk
aS
co
hl
e
s

bn
ad
wd
ih
tT
et
s

ay
sn
cP
AI

0
.
5

Note: all runs on same Nvidia GPU for fair comparison purposes
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CU2CL Reliability
0
%

1%
0

2%
0

3%
0

4%
0

5%
0

6%
0

7%
0

8%
0

9%
0

10
0%

Bfr
e e
o
d
U gae
pr s

C D S KSm l
U A D aps
e
2.
0%
3

1%
1
.
4

6.
8%
3

Rd iSm l
oi a p s
n
a
e
5.
2%
9

1%
1
.
8

3.
5%
3
2%
.
5

Atr
f
e
d
U gae
pr s

C D S KSm l
U A D aps
e
2.
0%
3

17
2%
.

25
1%
.

12
5%
.

1%
.
3

2.
4%
1

Rd iSm l
oi a p s
n
a
e
5.
2%
9

5 % 2.
.
9
3%
5
Failed
Partial
Complete

Clang 3.2
main() method handling
Template handling

5% 5% 5%
.
9
.
9
.
9

OpenGL #defined function handling
Separately declared and defined function handling
Kernel pointer invocation handling

Increase reliability in translating samples
after latest round of improvements
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CU2CL Roadmap & Future Work
CU2CL
Alpha
(2011)
Well-designed
scaffold

CU2CL
Beta
(2013)
Improved Robustness,
CUDA Coverage, and
Reliability
Analysis and profiling
of difficult-to-translate
CUDA structures

CU2CL w/
Functional
Portability
Expand CUDA
coverage
• Shared, const,
texture memory
• Driver API
• OpenGL
Handling unmapped
CUDA structs /
behaviors
• Warp sync

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

CU2CL w/
Performance
Portability
Automatic
de-optimization
Device-agnostic
optimization
Device-specific
optimization

synergy.cs.vt.edu
CU2CL Roadmap & Future Work
CU2CL
Alpha
(2011)
Well-designed
scaffold

CU2CL
Beta
(2013)
Improved Robustness,
CUDA Coverage, and
Reliability
Analysis and profiling
of difficult-to-translate
CUDA structures

CU2CL w/
Functional
Portability
Expand CUDA
coverage
• Shared, const,
texture memory
• Driver API
• OpenGL
Handling unmapped
CUDA structs /
behaviors
• Warp sync

CU2CL w/
Performance
Portability
Automatic
de-optimization
Device-agnostic
optimization
Device-specific
optimization

What about CUDA to HSA?
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Related Work
Swan
– High-level abstraction API, links to either OpenCL or CUDA
implementation

Ocelot & Caracal
– Translate NVIDIA PTX IR to other device IRs

CUDAtoOpenCL
– Source to source translator, based on Cetus

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CU2CL Conclusions

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CU2CL Conclusions
• Status
– What used to take months by hand takes seconds
• 90+ successful translation
• Negligible difference in performance

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CU2CL Conclusions
• Status
– What used to take months by hand takes seconds
• 90+ successful translation
• Negligible difference in performance

• Challenges
– CUDA functionality missing in OpenCL
• __threadfence()

– Equivalent libraries needed in OpenCL
• cuFFT, MAGMA, cuBLAS

– Implicit semantics
• Implicit synchronization across warps

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CU2CL Conclusions
• Status
– What used to take months by hand takes seconds
• 90+ successful translation
• Negligible difference in performance

• Challenges
– CUDA functionality missing in OpenCL
• __threadfence()

– Equivalent libraries needed in OpenCL
• cuFFT, MAGMA, cuBLAS

– Implicit semantics
• Implicit synchronization across warps

• What's Next?
– Improved functional portability
– Support for performance portability
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Acknowledgements
Suet Gabriel Martinez, Paul Sathre
t n:
d s
This work was supported in part by NSF I/UCRC IIP-0804155
via the NSF Center for High-Performance Reconfigurable
Computing (CHREC).

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

More Related Content

What's hot

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...AMD Developer Central
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Tomasz Bednarz
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...AMD Developer Central
 
OpenCL - The Open Standard for Heterogeneous Parallel Programming
OpenCL - The Open Standard for Heterogeneous Parallel ProgrammingOpenCL - The Open Standard for Heterogeneous Parallel Programming
OpenCL - The Open Standard for Heterogeneous Parallel ProgrammingAndreas Schreiber
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterAMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Using Docker for GPU Accelerated Applications
Using Docker for GPU Accelerated ApplicationsUsing Docker for GPU Accelerated Applications
Using Docker for GPU Accelerated ApplicationsNVIDIA
 

What's hot (20)

Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
 
OpenCL - The Open Standard for Heterogeneous Parallel Programming
OpenCL - The Open Standard for Heterogeneous Parallel ProgrammingOpenCL - The Open Standard for Heterogeneous Parallel Programming
OpenCL - The Open Standard for Heterogeneous Parallel Programming
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Using Docker for GPU Accelerated Applications
Using Docker for GPU Accelerated ApplicationsUsing Docker for GPU Accelerated Applications
Using Docker for GPU Accelerated Applications
 

Viewers also liked

Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jancstalks
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - ConfooSirKetchup
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux ClubOfer Rosenberg
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteNVIDIA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUNur Ahmadi
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architectureCHIHTE LU
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architecturesinside-BigData.com
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU ArchitectureMark Kilgard
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsMarcos Gonzalez
 

Viewers also liked (20)

Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jan
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Gpgpu
GpgpuGpgpu
Gpgpu
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 Keynote
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPU
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architecture
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU Architecture
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
 

Similar to PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)David Evans
 
Samsung WebCL Prototype API
Samsung WebCL Prototype APISamsung WebCL Prototype API
Samsung WebCL Prototype APIRyo Jin
 
CUDA Tutorial 01 : Say Hello to CUDA : Notes
CUDA Tutorial 01 : Say Hello to CUDA : NotesCUDA Tutorial 01 : Say Hello to CUDA : Notes
CUDA Tutorial 01 : Say Hello to CUDA : NotesSubhajit Sahu
 
Gdc09 Minigames
Gdc09 MinigamesGdc09 Minigames
Gdc09 MinigamesSusan Gold
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxgopikahari7
 
Delivering Go.CD with Terraform and Docker
Delivering Go.CD with Terraform and DockerDelivering Go.CD with Terraform and Docker
Delivering Go.CD with Terraform and DockerJorrit Salverda
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialNeera Agarwal
 
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...Ian Lumb
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
 
Disksim with SSD_extension
Disksim with SSD_extensionDisksim with SSD_extension
Disksim with SSD_extensioncucufrog
 
DevSecCon London 2017 - MacOS security, hardening and forensics 101 by Ben Hu...
DevSecCon London 2017 - MacOS security, hardening and forensics 101 by Ben Hu...DevSecCon London 2017 - MacOS security, hardening and forensics 101 by Ben Hu...
DevSecCon London 2017 - MacOS security, hardening and forensics 101 by Ben Hu...DevSecCon
 
The OpenCL C++ Wrapper 1.2 Reference Card
The OpenCL C++ Wrapper 1.2 Reference CardThe OpenCL C++ Wrapper 1.2 Reference Card
The OpenCL C++ Wrapper 1.2 Reference CardThe Khronos Group Inc.
 
Defcon CTF quals
Defcon CTF qualsDefcon CTF quals
Defcon CTF qualssnyff
 
DevOpSec_DockerNPodMan-20230220.pdf
DevOpSec_DockerNPodMan-20230220.pdfDevOpSec_DockerNPodMan-20230220.pdf
DevOpSec_DockerNPodMan-20230220.pdfkanedafromparis
 
MITRE ATT&CKcon 2018: From Technique to Detection, Paul Ewing and Ross Wolf, ...
MITRE ATT&CKcon 2018: From Technique to Detection, Paul Ewing and Ross Wolf, ...MITRE ATT&CKcon 2018: From Technique to Detection, Paul Ewing and Ross Wolf, ...
MITRE ATT&CKcon 2018: From Technique to Detection, Paul Ewing and Ross Wolf, ...MITRE - ATT&CKcon
 
Summer of Fuzz: macOS
Summer of Fuzz: macOSSummer of Fuzz: macOS
Summer of Fuzz: macOSJeremy Brown
 
Native Java with GraalVM
Native Java with GraalVMNative Java with GraalVM
Native Java with GraalVMSylvain Wallez
 
New Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhereNew Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhereRodrique Heron
 

Similar to PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner (20)

Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)
 
Samsung WebCL Prototype API
Samsung WebCL Prototype APISamsung WebCL Prototype API
Samsung WebCL Prototype API
 
CUDA Tutorial 01 : Say Hello to CUDA : Notes
CUDA Tutorial 01 : Say Hello to CUDA : NotesCUDA Tutorial 01 : Say Hello to CUDA : Notes
CUDA Tutorial 01 : Say Hello to CUDA : Notes
 
Gdc09 Minigames
Gdc09 MinigamesGdc09 Minigames
Gdc09 Minigames
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
 
Delivering Go.CD with Terraform and Docker
Delivering Go.CD with Terraform and DockerDelivering Go.CD with Terraform and Docker
Delivering Go.CD with Terraform and Docker
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics Tutorial
 
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
OpenCL C++ Wrapper 1.2 Reference Card
OpenCL C++ Wrapper 1.2 Reference CardOpenCL C++ Wrapper 1.2 Reference Card
OpenCL C++ Wrapper 1.2 Reference Card
 
Disksim with SSD_extension
Disksim with SSD_extensionDisksim with SSD_extension
Disksim with SSD_extension
 
DevSecCon London 2017 - MacOS security, hardening and forensics 101 by Ben Hu...
DevSecCon London 2017 - MacOS security, hardening and forensics 101 by Ben Hu...DevSecCon London 2017 - MacOS security, hardening and forensics 101 by Ben Hu...
DevSecCon London 2017 - MacOS security, hardening and forensics 101 by Ben Hu...
 
Linux kernel modules
Linux kernel modulesLinux kernel modules
Linux kernel modules
 
The OpenCL C++ Wrapper 1.2 Reference Card
The OpenCL C++ Wrapper 1.2 Reference CardThe OpenCL C++ Wrapper 1.2 Reference Card
The OpenCL C++ Wrapper 1.2 Reference Card
 
Defcon CTF quals
Defcon CTF qualsDefcon CTF quals
Defcon CTF quals
 
DevOpSec_DockerNPodMan-20230220.pdf
DevOpSec_DockerNPodMan-20230220.pdfDevOpSec_DockerNPodMan-20230220.pdf
DevOpSec_DockerNPodMan-20230220.pdf
 
MITRE ATT&CKcon 2018: From Technique to Detection, Paul Ewing and Ross Wolf, ...
MITRE ATT&CKcon 2018: From Technique to Detection, Paul Ewing and Ross Wolf, ...MITRE ATT&CKcon 2018: From Technique to Detection, Paul Ewing and Ross Wolf, ...
MITRE ATT&CKcon 2018: From Technique to Detection, Paul Ewing and Ross Wolf, ...
 
Summer of Fuzz: macOS
Summer of Fuzz: macOSSummer of Fuzz: macOS
Summer of Fuzz: macOS
 
Native Java with GraalVM
Native Java with GraalVMNative Java with GraalVM
Native Java with GraalVM
 
New Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhereNew Jersey Red Hat Users Group Presentation: Provisioning anywhere
New Jersey Red Hat Users Group Presentation: Provisioning anywhere
 

More from AMD Developer Central

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14AMD Developer Central
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14AMD Developer Central
 

More from AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

  • 1. Automated CUDA-to-OpenCL Translation with CU2CL: What’s Next? Wu Feng and Mark Gardner Virginia Tech 2013-11-12 synergy.cs.vt.edu
  • 2. Why OpenCL? http://www2.pcmag.com/media/imag es/375584-nvidia-geforce-gtx-titan.jp g?thumb=y http://www.amd.com/PublishingImages/P ublic/Photograph_ProductShots/375WPN G/61979.png http://www.hardwarezone.com.sg/file s/img/2012/06/Xeon_Phi_PCIe_Card_ M.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // http://www.thinkcomputers.org/articl es/ces11_amd/main.jpg http://www.bjorn3d.com/Material/revi mages/cpu/Core_I7_965/New_Core_I7.j pg synergy.cs.vt.edu
  • 3. Why OpenCL? http://www2.pcmag.com/media/imag es/375584-nvidia-geforce-gtx-titan.jp g?thumb=y http://www.amd.com/PublishingImages/P ublic/Photograph_ProductShots/375WPN G/61979.png http://www.hardwarezone.com.sg/file s/img/2012/06/Xeon_Phi_PCIe_Card_ M.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // http://www.thinkcomputers.org/articl es/ces11_amd/main.jpg http://www.bjorn3d.com/Material/revi mages/cpu/Core_I7_965/New_Core_I7.j pg synergy.cs.vt.edu
  • 4. Why OpenCL? http://www2.pcmag.com/media/imag es/375584-nvidia-geforce-gtx-titan.jp g?thumb=y http://www.amd.com/PublishingImages/P ublic/Photograph_ProductShots/375WPN G/61979.png http://www.hardwarezone.com.sg/file s/img/2012/06/Xeon_Phi_PCIe_Card_ M.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // http://www.thinkcomputers.org/articl es/ces11_amd/main.jpg http://www.bjorn3d.com/Material/revi mages/cpu/Core_I7_965/New_Core_I7.j pg synergy.cs.vt.edu
  • 5. Why OpenCL? Source code lasts longer than platforms http://www2.pcmag.com/media/imag es/375584-nvidia-geforce-gtx-titan.jp g?thumb=y http://www.amd.com/PublishingImages/P ublic/Photograph_ProductShots/375WPN G/61979.png http://www.hardwarezone.com.sg/file s/img/2012/06/Xeon_Phi_PCIe_Card_ M.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // http://www.thinkcomputers.org/articl es/ces11_amd/main.jpg http://www.bjorn3d.com/Material/revi mages/cpu/Core_I7_965/New_Core_I7.j pg synergy.cs.vt.edu
  • 6. The Goal To take advantage of OpenCL's portability... http://people.emich.edu/akavetsk/424/scribeatdesk_1.jpg Without sacrificing man-years of existing code A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 7. CUDA and OpenCL APIs CUDA Module OpenCL Module Thread Contexts & Command Queues Device Platforms & Devices Stream Command Queues Event Events Memory Memory Objects A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 8. CUDA and OpenCL APIs CUDA Module OpenCL Module Thread Contexts & Command Queues Device Platforms & Devices Stream Command Queues Event Events Memory Memory Objects A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 9. CUDA and OpenCL APIs CUDA Module OpenCL Module Thread Contexts & Command Queues Device Platforms & Devices Stream Command Queues Event Events Memory Memory Objects A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 10. CUDA and OpenCL Data CUDA OpenCL Vector types (e.g. float4) Host: cl_float4 Kernel: float4 dim3 size_t[3] cudaStream_t cl_command_queue cudaEvent_t cl_event Device pointers (e.g. float* created cl_mem created through through cudaMalloc) clCreateBuffer cudaChannelFormat cl_image_format textureReference cl_mem created through clCreateImage cudaDeviceProp No direct equivalent A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 11. CUDA and OpenCL Data CUDA OpenCL Vector types (e.g. float4) Host: cl_float4 Kernel: float4 dim3 size_t[3] cudaStream_t cl_command_queue cudaEvent_t cl_event Device pointers (e.g. float* created cl_mem created through through cudaMalloc) clCreateBuffer cudaChannelFormat cl_image_format textureReference cl_mem created through clCreateImage cudaDeviceProp No direct equivalent A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 12. CUDA and OpenCL Data CUDA OpenCL Vector types (e.g. float4) Host: cl_float4 Kernel: float4 dim3 size_t[3] cudaStream_t cl_command_queue cudaEvent_t cl_event Device pointers (e.g. float* created cl_mem created through through cudaMalloc) clCreateBuffer cudaChannelFormat cl_image_format textureReference cl_mem created through clCreateImage cudaDeviceProp No direct equivalent A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 13. CUDA and OpenCL Data CUDA OpenCL Vector types (e.g. float4) Host: cl_float4 Kernel: float4 dim3 size_t[3] cudaStream_t cl_command_queue cudaEvent_t cl_event Device pointers (e.g. float* created cl_mem created through through cudaMalloc) clCreateBuffer cudaChannelFormat cl_image_format textureReference cl_mem created through clCreateImage cudaDeviceProp No direct equivalent A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 14. CUDA and OpenCL Execution and Memory Models synergy.cs.vt.edu
  • 15. The Problem A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 16. The Problem MnaTastn aulr li nao (ek, ot ) w esm n s h CD UA Su e or c Cd oe O eC pnL Su e or c Cd oe xkcd.com A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 17. The Problem MnaTastn aulr li nao (ek, ot ) w esm n s h CD UA Su e or c Cd oe O eC pnL Su e or c Cd oe xkcd.com A tm t Tast n u ac r li o i nao ( cns s od) e C 2L UC A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 18. Forecast http://www.weather.com/weather/5-day/San+Jose+CA+USCA0993:1:US • • • • • Observations about Translating Examples: CUDA and OpenCL constructs CU2CL Architecture Current State of CU2CL: Robustness and Performance Future Directions synergy.cs.vt.edu
  • 19. Translation Is Easy ... A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 20. Translation Is Easy ... …when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 21. Translation Is Easy ... …when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping) • High-level language → low-level representation, e.g., C → LLVM x*y+z→ %tmp = mul i32 %x, %y %tmp2 = add i32 %tmp, %z A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 22. Translation Is Easy ... …when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping) • High-level language → low-level representation, e.g., C → LLVM x*y+z→ %tmp = mul i32 %x, %y %tmp2 = add i32 %tmp, %z • Between languages, e.g., CUDA → OpenCL __powf(x[threadIdx.x], y[threadIdx.y]) → native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 23. Translation is more difficult A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 24. Translation is more difficult …when there IS ambiguity (or lack of a direct mapping) in the translation between languages A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 25. Translation is more difficult …when there IS ambiguity (or lack of a direct mapping) in the translation between languages • Idiomatic Expressions – “Putting all your eggs in one basket” → ?? in Spanish – CUDA threadfence() → OpenCL ?? A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 26. Translation is more difficult …when there IS ambiguity (or lack of a direct mapping) in the translation between languages • Idiomatic Expressions – “Putting all your eggs in one basket” → ?? in Spanish – CUDA threadfence() → OpenCL ?? • Dialects – Latin American Spanish vs. Castilian Spanish → English – CUDA Runtime API vs. CUDA Driver API → OpenCL A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 27. CUDA and OpenCL http://www.dragon1.com/images/examples.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 28. CUDA Initialization Code None (Implicit) Dialect: CUDA runtime API A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 29. OpenCL Initialization Code Explicit //get a platform and device, set up a context and command queue clGetPlatformIDs(1, &__cu2cl_Platform, NULL); clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL); __cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL); __cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL); //read kernel source from disk FILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f); //build device program and kernel __cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL); free((void *) progSrc); clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL); __cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 30. OpenCL Initialization Code Explicit //get a platform and device, set up a context and command queue clGetPlatformIDs(1, &__cu2cl_Platform, NULL); clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL); __cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL); __cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL); //read kernel source from disk FILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f); //build device program and kernel __cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL); free((void *) progSrc); clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL); __cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 31. OpenCL Initialization Code Explicit //get a platform and device, set up a context and command queue clGetPlatformIDs(1, &__cu2cl_Platform, NULL); clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL); __cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL); __cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL); //read kernel source from disk FILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f); //build device program and kernel __cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL); free((void *) progSrc); clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL); __cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 32. CUDA Kernel Invocation // setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y); // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 33. CUDA Kernel Invocation // setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y); // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 34. CUDA Kernel Invocation // setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y); // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 35. OpenCL Kernel Invocation // setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1}; // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 36. OpenCL Kernel Invocation // setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1}; // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 37. OpenCL Kernel Invocation // setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1}; // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 38. Kernel Code for Vector Add CUDA // Device code __global__ void VecAdd(const float* A, const float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } OpenCL // Device code __kernel void VecAdd(const __global float* A, const __global float* B, __global float* C, int N) { int i = get_local_size(0) * get_group_id(0) + get_local_id(0); if (i < N) C[i] = A[i] + B[i]; } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 39. Kernel Code for Vector Add CUDA // Device code __global__ void VecAdd(const float* A, const float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } OpenCL // Device code __kernel void VecAdd(const __global float* A, const __global float* B, __global float* C, int N) { int i = get_local_size(0) * get_group_id(0) + get_local_id(0); if (i < N) C[i] = A[i] + B[i]; } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 40. Kernel Code for Vector Add CUDA // Device code __global__ void VecAdd(const float* A, const float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } OpenCL // Device code __kernel void VecAdd(const __global float* A, const __global float* B, __global float* C, int N) { int i = get_local_size(0) * get_group_id(0) + get_local_id(0); if (i < N) C[i] = A[i] + B[i]; } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 41. Kernel Code for Vector Add CUDA // Device code __global__ void VecAdd(const float* A, const float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } OpenCL // Device code __kernel void VecAdd(const __global float* A, const __global float* B, __global float* C, int N) { int i = get_local_size(0) * get_group_id(0) + get_local_id(0); if (i < N) C[i] = A[i] + B[i]; } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 43. Compilation Process A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 45. Compilation Process Preprocessor Source Code Lexer Preprocessed Code Semantic Analyzer Parser Tokenized Code Parse Tree Code Generator Intermediate Representation Binary LLVM Clang Martinez, Gardner, and Feng, “CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures,” IEEE ICPADS 2011 A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 46. AST-driven, String-based Rewriting CUDA OpenCL __powf(x[threadIdx.x], y[threadIdx.y]) native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 47. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func OpenCL native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 48. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Arg OpenCL native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 49. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Arg OpenCL Struct Struct native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 50. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Field Arg OpenCL Struct Struct Field native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 51. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Field Arg OpenCL Struct Struct Field native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 52. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Field 0 Arg OpenCL Struct Struct Field 1 native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 53. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg Field 0 Arg OpenCL Struct Struct Field 1 native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 54. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg Struct Field 0 Arg Struct Field 1 x[ OpenCL ] y[ ] native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 55. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg OpenCL Field 0 Arg native_pow Struct Struct Field 1 x[ ] y[ ] native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 56. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg OpenCL Field 0 Arg native_pow Struct Struct Field 1 x[ Write Out ] y[ ] native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 57. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg OpenCL Field 0 Arg native_pow Struct Struct Field 1 x[ Write Out ] y[ ] native_pow(x[get_local_id(0)], y[get_local_id(1)]) Advantage: formatting remains intact → maintainable A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 58. Complex Semantic Conversions A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 59. Complex Semantic Conversions 1. Literal Parameters to Kernels – CUDA pass-by-value invocations vs. OpenCL pass-by-reference A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 60. Complex Semantic Conversions 1. Literal Parameters to Kernels – CUDA pass-by-value invocations vs. OpenCL pass-by-reference CUDA Kernel Launch kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 61. Complex Semantic Conversions 1. Literal Parameters to Kernels – CUDA pass-by-value invocations vs. OpenCL pass-by-reference CUDA Kernel Launch kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256); Naive OpenCL Translation clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1); clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f); clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 62. Complex Semantic Conversions 1. Literal Parameters to Kernels – CUDA pass-by-value invocations vs. OpenCL pass-by-reference CUDA Kernel Launch kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256); Naive OpenCL Translation clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1); clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f); clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256); Correct OpenCL Translation clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1); float __cu2cl_Kernel_kernel_arg_1 = foo2 * 2.0f; clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &__cu2cl_Kernel_kernel_arg_1); int __cu2cl_Kernel_kernel_arg_2 = 256; clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &__cu2cl_Kernel_kernel_arg_2); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 63. Complex Semantic Conversions A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 64. Complex Semantic Conversions 2. Device Identification – CUDA uses int, OpenCL uses opaque cl_device – To change devices in CUDA, use cudaSetDevice(int id) – To change devices in OpenCL, use... A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 65. Complex Semantic Conversions 2. Device Identification – CUDA uses int, OpenCL uses opaque cl_device – To change devices in CUDA, use cudaSetDevice(int id) – To change devices in OpenCL, use... //scan all devices //save old platform, device, context, queue, program, & kernels myDevice = allDevices[id] ClGetDeviceInfo(...); //get new device's platform myContext = clCreateContext(...); myQueue = clCreateCommandQueue(...); //load program source clBuildProgram(...); myKernel = clCreateKernel(...); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 66. Complex Semantic Conversions 2. Device Identification – CUDA uses int, OpenCL uses opaque cl_device – To change devices in CUDA, use cudaSetDevice(int id) – To change devices in OpenCL, use... //scan all devices //save old platform, device, context, queue, program, & kernels myDevice = allDevices[id] ClGetDeviceInfo(...); //get new device's platform myContext = clCreateContext(...); myQueue = clCreateCommandQueue(...); //load program source clBuildProgram(...); myKernel = clCreateKernel(...); – Implement our own handler to emulate and encapsulate A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 67. CU2CL Evaluation Image: http://learn.cvuhs.org/file.php/1427/scales_of_justice2.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 68. Test Code A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 69. Test Code • 79 CUDA SDK Samples • 17 Rodinia Samples • Applications – GEM – Molecular Modeling – IZ PS – Neural Network – Fen Zi – Molecular Dynamics • 100k+ SLOC in total A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 70. Test Code • 79 CUDA SDK Samples • 17 Rodinia Samples • Applications – GEM – Molecular Modeling – IZ PS – Neural Network – Fen Zi – Molecular Dynamics • 100k+ SLOC in total A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 71. Test Code • 79 CUDA SDK Samples • 17 Rodinia Samples • Applications – GEM – Molecular Modeling – IZ PS – Neural Network – Fen Zi – Molecular Dynamics • 100k+ SLOC in total A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 72. Translator Coverage O eC L e pnL i s n C agd hne Pr n e et c A tm ta Tast u acl r le o i l na d y 1 3 5 5 9. 6 3 bnwdh et ad i T s t 81 9 5 9. 8 9 B cShl l kco s a e 37 4 1 4 9. 6 0 Fs l Tas r at s r f m Wah n o 37 2 3 0 9. 0 8 m t Ml ai u r x 31 5 9 9. 7 4 sar o clP d ar 21 5 1 8 9. 2 8 vco d etr d A 1 4 7 0 10 0 Bc P pgt n ak r aao o i Rd i oi n a C D Le U A is n ayc P snA I S KSm l D aps e A pctn plao ii 3 1 3 2 4 9. 2 3 B at FsSa h r dh it er e -r c 36 0 3 5 8. 8 6 Gusn asi a 30 9 2 6 9. 3 3 Ht o os t p 38 2 2 9. 9 4 Nel a- nc ed m nWush e 40 3 3 9. 9 3 Fn i eZ 16 78 7 16 7 8 8. 9 9 GM E 54 2 1 5 9. 7 1 IP ZS 80 42 1 6 6 9. 8 0 A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 73. Translator Coverage O eC L e pnL i s n C agd hne Pr n e et c A tm ta Tast u acl r le o i l na d y 1 3 5 5 9. 6 3 bnwdh et ad i T s t 81 9 5 9. 8 9 B cShl l kco s a e 37 4 1 4 9. 6 0 Fs l Tas r at s r f m Wah n o 37 2 3 0 9. 0 8 m t Ml ai u r x 31 5 9 9. 7 4 sar o clP d ar 21 5 1 8 9. 2 8 vco d etr d A 1 4 7 0 10 0 Bc P pgt n ak r aao o i Rd i oi n a C D Le U A is n ayc P snA I S KSm l D aps e A pctn plao ii 3 1 3 2 4 9. 2 3 B at FsSa h r dh it er e -r c 36 0 3 5 8. 8 6 Gusn asi a 30 9 2 6 9. 3 3 Ht o os t p 38 2 2 9. 9 4 Nel a- nc ed m nWush e 40 3 3 9. 9 3 Fn i eZ 16 78 7 16 7 8 8. 9 9 GM E 54 2 1 5 9. 7 1 IP ZS 80 42 1 6 6 9. 8 0 A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 74. Translation Challenges Identified P fd ri ol e C a ne hl g l e C D SK UA D F qec ( ) r uny % e Rd i oi n a F qec ( ) r uny % e D v ednfr ei I tis c ei e 5. 4 4 2. 9 4 Le la m t s ir Pr e r ta a e 10 9 . 2. 3 5 Spre o pao ea t C m i i a l n t 5. 4 4 2. 9 4 C D L ri U A iae b rs 1. 0 1 0 K r le p ts e eT m le n a 25 1 . 0 T x rM m r et e e o u y 2. 7 8 2. 3 5 Gah sn r e bi 2. r i Ie pr iy 4 p c to a l t 1 0 C nt t e o os nM m r a y 17 7 . 2. 9 4 Sa d e o hr M m r e y 4. 6 8 Kernel Function Pointer Invocations Preprocessor Effects Warp-level Synchronization Device Intrinsic Functions Device Buffer cl_mem Type Propagation #defined Function Definitions Device Buffers as Struct Members Arrays of Device Buffers Implicitly-Defined Kernel Functions Device-side Classes, Constructors, & Destructors Struct Alignment Atbt ti e ru s _t edec( _h af e r n ) 7. 0 6 Sathre, Gardner, Feng: “Lost in Translation: Challenges in Automating CUDA-to-OpenCL Translation”. ICPP Workshops 2012: 89-96 Gardner, Feng, Sathre, Martinez: “Characterizing the Challenges and Evaluating the Efficacy of a CUDA-to-OpenCL Translator”. ParCo Special Issue 2013, to appear A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 75. Translator Performance 10 0 00 1 0 R= . +01 6 R= . +05 9 10 00 Total Translation Time (s) 1 CU2CL Translation Time (microseconds) 10 0 0 . 1 01 . 0 10 0 S KSm l D a ps e 1 0 Su e i s or L e c n 10 00 Rd iSm l o ia a p s n e 10 0 00 10 0 000 Lre plaos a A pctn g i i 1 10 0 Su e i s or L e c n 10 00 S KSm l D a ps e Rd iSm l o ia a p s n e Lre plaos a A pctn g i i 10 0 00 10 0 000 L er D Sm l ) i a( K a p s n S e L er o ia a p s i a( d iSm l ) n R n e Experimental Setup: AMD Phenom II X6 1090T (six-cores 3.2Ghz), 16 GB RAM, NVIDIA GeForce GTX 480 (driver version 310.32, CUDA Runtime 5.0), 64-bit Ubuntu 12.04 A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 76. Translated Application Performance 2 . 5 S KSm l D aps e Rd iSm l oi a p s n a e Time (s) CUDA OpenCL 2 Lower is Better 1 . 5 1 GM E Ne ed la e mn -u Wn sh c Ht os pt o Gu as sn i a BS F bc ak po rp vc et od rd A sa cl ar ro Pd mt ai rM xu l Fs at Wa lT s hr as nf om r Bc lk aS co hl e s bn ad wd ih tT et s ay sn cP AI 0 . 5 Note: all runs on same Nvidia GPU for fair comparison purposes A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 77. CU2CL Reliability 0 % 1% 0 2% 0 3% 0 4% 0 5% 0 6% 0 7% 0 8% 0 9% 0 10 0% Bfr e e o d U gae pr s C D S KSm l U A D aps e 2. 0% 3 1% 1 . 4 6. 8% 3 Rd iSm l oi a p s n a e 5. 2% 9 1% 1 . 8 3. 5% 3 2% . 5 Atr f e d U gae pr s C D S KSm l U A D aps e 2. 0% 3 17 2% . 25 1% . 12 5% . 1% . 3 2. 4% 1 Rd iSm l oi a p s n a e 5. 2% 9 5 % 2. . 9 3% 5 Failed Partial Complete Clang 3.2 main() method handling Template handling 5% 5% 5% . 9 . 9 . 9 OpenGL #defined function handling Separately declared and defined function handling Kernel pointer invocation handling Increase reliability in translating samples after latest round of improvements A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 78. CU2CL Roadmap & Future Work CU2CL Alpha (2011) Well-designed scaffold CU2CL Beta (2013) Improved Robustness, CUDA Coverage, and Reliability Analysis and profiling of difficult-to-translate CUDA structures CU2CL w/ Functional Portability Expand CUDA coverage • Shared, const, texture memory • Driver API • OpenGL Handling unmapped CUDA structs / behaviors • Warp sync A DDvl eSm i M ee pru mt o 2 1 11 03 12 // CU2CL w/ Performance Portability Automatic de-optimization Device-agnostic optimization Device-specific optimization synergy.cs.vt.edu
  • 79. CU2CL Roadmap & Future Work CU2CL Alpha (2011) Well-designed scaffold CU2CL Beta (2013) Improved Robustness, CUDA Coverage, and Reliability Analysis and profiling of difficult-to-translate CUDA structures CU2CL w/ Functional Portability Expand CUDA coverage • Shared, const, texture memory • Driver API • OpenGL Handling unmapped CUDA structs / behaviors • Warp sync CU2CL w/ Performance Portability Automatic de-optimization Device-agnostic optimization Device-specific optimization What about CUDA to HSA? A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 80. Related Work Swan – High-level abstraction API, links to either OpenCL or CUDA implementation Ocelot & Caracal – Translate NVIDIA PTX IR to other device IRs CUDAtoOpenCL – Source to source translator, based on Cetus A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 81. CU2CL Conclusions A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 82. CU2CL Conclusions • Status – What used to take months by hand takes seconds • 90+ successful translation • Negligible difference in performance A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 83. CU2CL Conclusions • Status – What used to take months by hand takes seconds • 90+ successful translation • Negligible difference in performance • Challenges – CUDA functionality missing in OpenCL • __threadfence() – Equivalent libraries needed in OpenCL • cuFFT, MAGMA, cuBLAS – Implicit semantics • Implicit synchronization across warps A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 84. CU2CL Conclusions • Status – What used to take months by hand takes seconds • 90+ successful translation • Negligible difference in performance • Challenges – CUDA functionality missing in OpenCL • __threadfence() – Equivalent libraries needed in OpenCL • cuFFT, MAGMA, cuBLAS – Implicit semantics • Implicit synchronization across warps • What's Next? – Improved functional portability – Support for performance portability A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  • 85. Acknowledgements Suet Gabriel Martinez, Paul Sathre t n: d s This work was supported in part by NSF I/UCRC IIP-0804155 via the NSF Center for High-Performance Reconfigurable Computing (CHREC). A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu