4. Introduction – who am I?
●
Five years in the industry
●
Spent all of that using SPUs, GPUs, vectors units &
DSPs
●
Last two years focused on open standards (mostly
OpenCL)
●
Passionate about making compute easy
Neil Henning
neil@codeplay.com
5. Introduction – who are we?
●
GPU Compiler Experts based out of Edinburgh, Scotland
●
35 employees working on contracts, R&D and internal tech
Neil Henning
neil@codeplay.com
7. Current Landscape
●
Languages – CUDA, RenderScript, C++AMP & OpenCL
●
Targets – GPU (mobile & desktop), CPU (scalar & vector), DSPs, FPGAs
●
Concerns – performance, power, precision, parallelism & portability
Neil Henning
neil@codeplay.com
8. Current Landscape - CUDA
__global__ void kernel(char * a, char * b)
{
a[blockIdx.x] = b[blockIdx.x];
}
char in[SIZE], out[SIZE];
char * cIn, * cOut;
cudaMalloc((void **)&cIn, SIZE);
cudaMalloc((void **)&cOut, SIZE);
cudaMemcpy(cIn, in, size,
cudaMemcpyHostToDevice);
kernel<<<SIZE, 1>>>(cOut, cIn);
cudaMemcpy(out, cOut, size,
cudaMemcpyDeviceToHost);
cudaFree(cIn);
cudaFree(cOut);
●
CUDA incredibly established
●
●
First major GPU compute approach to market
majority of devices
●
Huge bank of tools, libraries and knowledge
●
Really only had uptake in offline processing
●
Used in banking, medical imaging, game asset
●
Standard isn’t open, little room (or enthusiasm) for
creation, and many many more uses!
Using CUDA means abandoning compute on
other vendors to implement
Neil Henning
neil@codeplay.com
9. Current Landscape - RenderScript
#pragma version(1)
#pragma rs java_package_name(foo)
rs_allocation gIn; rs_allocation gOut;
rs_script gScript;
void root(const char * in, char * out,
const void * usr, uint32_t x, uint32_t y) {
*out = *in;
}
void filter() {
rsForEach(gScript, gIn, gOut, NULL);
}
Context ctxt = /* … */;
RenderScript rs = RenderScript.create(ctxt);
ScriptC_foo script = new ScriptC_foo(rs,
getResources(), R.raw.foo);
Allocation in = Allocation.createSized(rs,
Element.I8(rs), SIZE);
Allocation out = Allocation.createSized(rs,
Element.I8(rs), SIZE);
script.set_gIn(in); script.set_gOut(out);
script.set_gScript(script);
script.invoke_filter();
●
Intelligent runtime load balances kernels
●
Only on Android
●
Creates Java classes to interface with kernels
●
Limited documentation & shortage of examples
●
Focused on performance portability
●
No real idea of feature roadmap
Neil Henning
neil@codeplay.com
10. Current Landscape – C++AMP
int in[SIZE], out[SIZE];
array_view<const int, 1> aIn(SIZE, in);
array_view<int, 1> aOut(SIZE, out);
aOut.discard_data();
parallel_for_each(aOut.extent,
[=](index<1> idx) restrict(amp)
{
aOut[idx] = aIn[idx];
}
);
●
Very well thought out single source approach
●
Lovely use of C++ templates to capture type information,
array dimensions
●
Great use of C++11 Lambda’s for capturing kernel intent
●
Part of target community is really C++11 averse, need
convincing
Limited low-level support
●
Initial interest by community faded fast
●
// can access aOut[…] like normal
●
Xbox One will support C++AMP – watch this space
Neil Henning
neil@codeplay.com
11. Current Landscape - OpenCL
void kernel foo(global int * a, global int * b)
{
int idx = get_global_id(0);
a[idx] = b[idx];
}
// device, context, queue, in, out already created
cl_program program =
clCreateProgramWithSource(context, 1,
fooAsStr, NULL, NULL);
clBuildProgram(program, 1, &device,
NULL, NULL, NULL);
cl_kernel kernel = clCreateKernel(program,
“foo”, NULL);
// set kernel arguments
clEnqueueNDRangeKernel(queue, kernel, 1,
NULL, &size, NULL, 0, NULL, NULL);
●
Open standard with many contributors
●
API is verbose, very very verbose!
●
API puts control in developer hands
●
Steep learning curve for new developers
●
Support on lots of heterogeneous platforms – not just GPUs!
●
Have to support diverse range of application types
Neil Henning
neil@codeplay.com
12. Current Landscape
Modern systems have many compute-capable devices in them
Not unlike the fictitious system shown above!
Neil Henning
neil@codeplay.com
13. Current Landscape
Scalar CPUs are the ‘normal’ target for programmers, easy
to target, easy to use
Mostly a fallback target for
compute currently
Neil Henning
neil@codeplay.com
14. Current Landscape
Scalar CPUs are the ‘normal’ target for programmers, easy
to target, easy to use
Mostly a fallback target for
compute currently
Vector units are supported if
kernel has vector types
Can auto-vectorize user kernels,
as vector units harder for ‘normal’ programmers to target
Neil Henning
neil@codeplay.com
15. Current Landscape
Scalar CPUs are the ‘normal’ target for programmers, easy
to target, easy to use
Mostly a fallback target for
compute currently
Vector units are supported if
kernel has vector types
Can auto-vectorize user kernels,
as vector units harder for ‘normal’ programmers to target
Can make no assumptions as to
what DSPs ‘look’ like
Digital Signal Processors (DSPs)
are a future target for the compute market
Neil Henning
neil@codeplay.com
16. Current Landscape
Scalar CPUs are the ‘normal’ target for programmers, easy
to target, easy to use
Mostly a fallback target for
compute currently
Vector units are supported if
kernel has vector types
Can auto-vectorize user kernels,
as vector units harder for ‘normal’ programmers to target
GPUs do not forgive poor code like a CPU or even a DSP
could, require large arrays of work to utilize
GPUs are the reason we have
compute in the first place
Can make no assumptions as to
what they ‘look’ like
Digital Signal Processors (DSPs)
are a future target for the compute market
Neil Henning
neil@codeplay.com
17. Current Landscape
●
●
Have to weigh up many competing concerns for languages
Platform, operating system, device type, battery life, use case
Neil Henning
neil@codeplay.com
18. What is wrong with the current landscape
Neil Henning
neil@codeplay.com
19. What is wrong with the current landscape
●
Compute approaches are not on all device and OS combinations
●
No CUDA on AMD, RenderScript on iOS or C++AMP on Linux
●
Have to support offline precise compute & time-bound online compute
●
Very divergent targets/use cases/device types is problematic!
Neil Henning
neil@codeplay.com
20. What is wrong with the current landscape
●
What if loop count is always multiple of four?
void foo(int * a, int * b, int * count)
{
for(int idx = 0; idx < *(count); ++idx)
{
a[idx] = 42 * b[idx];
}
}
Neil Henning
neil@codeplay.com
21. What is wrong with the current landscape
●
void foo(int * a, int * b, int * count)
{
for(int idx = 0; idx < *(count); idx += 4)
{
a[idx + 0] = 42 * b[idx + 0];
a[idx + 1] = 42 * b[idx + 1];
a[idx + 2] = 42 * b[idx + 2];
a[idx + 3] = 42 * b[idx + 3];
}
}
What if loop count is always multiple of four?
●
Can unroll the loop four times!
Neil Henning
neil@codeplay.com
22. What is wrong with the current landscape
●
void foo(int * a, int * b, int * count)
{
for(int idx = 0; idx < *(count); idx += 4)
{
a[idx + 0] = 42 * b[idx + 0];
a[idx + 1] = 42 * b[idx + 1];
a[idx + 2] = 42 * b[idx + 2];
a[idx + 3] = 42 * b[idx + 3];
}
}
What if loop count is always multiple of four?
●
Can unroll the loop four times!
●
What if pointers a & b are sixteen byte aligned?
Neil Henning
neil@codeplay.com
23. What is wrong with the current landscape
●
What if loop count is always multiple of four?
●
Can unroll the loop four times!
●
What if pointers a & b are sixteen byte aligned?
●
void foo(int * a, int * b, int * count)
{
int vecCount = count / 4;
int4 * vA = (int4 * )a;
int4 * vB = (int4 * )b;
Can vectorize the loop body!
for(int idx = 0; idx < vecCount; ++idx)
{
vA[idx] = vB[idx] * (int4 )42;
}
}
Neil Henning
neil@codeplay.com
24. What is wrong with the current landscape
for(int idx = 0; idx < vecCount; ++idx)
{
vA[idx] = vB[idx] * (int4 )42;
}
●
What if loop count is always multiple of four?
●
Can unroll the loop four times!
●
What if pointers a & b are sixteen byte aligned?
●
void foo(int * a, int * b, int * count)
{
int vecCount = count / 4;
int4 * vA = (int4 * )a;
int4 * vB = (int4 * )b;
Can vectorize the loop body!
●
Why does my code look so radically different now?
}
Neil Henning
neil@codeplay.com
25. What is wrong with the current landscape
for(int idx = 0; idx < vecCount; ++idx)
{
vA[idx] = vB[idx] * (int4 )42;
}
●
What if loop count is always multiple of four?
●
Can unroll the loop four times!
●
What if pointers a & b are sixteen byte aligned?
●
void foo(int * a, int * b, int * count)
{
int vecCount = count / 4;
int4 * vA = (int4 * )a;
int4 * vB = (int4 * )b;
Can vectorize the loop body!
●
Why does my code look so radically different now?
●
Current languages force drastic developer interventions
}
Neil Henning
neil@codeplay.com
26. What is wrong with the current landscape
void foo(int * a, int * b, int * count)
{
int vecCount = count / 4;
int4 * vA = (int4 * )a;
int4 * vB = (int4 * )b;
for(int idx = 0; idx < vecCount; ++idx)
{
vA[idx] = vB[idx] * (int4 )42;
}
●
Existing languages (mostly) force developers to do coding
wizardry that is unnecessary
●
Also no real feedback to developer as ‘main’ compute
target has highly secretive ISAs
●
Don’t want to force vendors to reveal secrets, but do want
ability to influence kernel code generation
}
Neil Henning
neil@codeplay.com
27. What is wrong with the current landscape
●
Rely on vendors to provide tools to aid development
●
Debuggers, profilers, static analysis all increasingly required
●
Libraries can vastly decrease development time
●
Rely solely on vendors to provide all these complicated pieces
Neil Henning
neil@codeplay.com
28. What is wrong with the current landscape
●
Vendors already have lots of targets to support
●
Every generation of devices need to test conformance
●
Need to support compilers, graphics, compute, tools, list goes on!
●
Why should the vendor be the only one taking the burden?
Neil Henning
neil@codeplay.com
29. What is wrong with the current landscape
●
No one can agree on what is the ‘best’ approach
●
Personal preference of developer/organization sways opinions
●
Why not allow Lisp on a GPU? Lua on a DSP?
●
Vendor doesn’t need extra headache of supporting these niche use cases
Neil Henning
neil@codeplay.com
30. What is wrong with the current landscape
●
My pitch – let community support compute standards
●
Take the approach of LLVM & Clang
●
Vendor has to support lower standard on their hardware
●
But allows community to support & innovate
Neil Henning
neil@codeplay.com
31. How to enable your language on GPUs
Neil Henning
neil@codeplay.com
32. How to enable your language on GPUs
●
First step – be able to compile language to a binary
●
Can’t output real binary though
●
Vendor doesn’t want to expose ISA
●
Developer wants portability of compiled kernels
Neil Henning
neil@codeplay.com
33. How to enable your language on GPUs
●
Need to use an Intermediate Representation (IR)
●
Two approaches in development for this!
●
HSA Intermediate Language (HSAIL)
●
OpenCL Standard Portable Intermediate Representation (SPIR)
Neil Henning
neil@codeplay.com
34. How to enable your language on GPUs
Our
Language
Our
Language
●
Language -> LLVM IR -> HSAIL
●
Language -> LLVM IR -> SPIR
●
Low level mapping onto hardware, more of a virtual ISA
●
Then pass SPIR to OpenCL runtime as binary
●
Execute like normal OpenCL C Language kernel
●
Provisional specification available!
than an IR
●
HSAIL heavily in development
Neil Henning
neil@codeplay.com
35. How to enable your language on GPUs
Our
Language
●
HSA will provide a low-level runtime to interface
between HSA compiled binaries and OS
Our
Language
●
OpenCL SPIR will require a SPIR compliant OpenCL
implementation as target
●
HSAIL is being standardized and ratified
●
Can compile using LLVM, then use
●
Existing JIT’ed languages potential targets
clCreateProgramWithBinary, passing SPIR options
Neil Henning
neil@codeplay.com
36. How to enable your language on GPUs
●
At present, SPIR is only target we can investigate
●
Intel has OpenCL drivers with provisional SPIR support
●
Can use Clang -> LLVM -> SPIR, then use Intel’s OpenCL to consume SPIR
●
Can take code that compiles to LLVM and run it on OpenCL
Neil Henning
neil@codeplay.com
37. How to enable your language on GPUs
●
Various steps to getting your language working on GPUs with SPIR
●
We’ll use Intel’s OpenCL SDK with provisional SPIR support;
1.
Create a test harness to load a SPIR binary
2.
Create a simple kernel using Intel’s SPIR compiler on host
3.
Create a simple kernel using tip Clang (language OpenCL) targeting SPIR
4.
Try other languages that compile to LLVM with SPIR target
Neil Henning
neil@codeplay.com
38. How to enable your language on GPUs
// some SPIR bitcode file
const unsigned char spir_bc[spir_bc_length];
// already initialized platform, device & context for a SPIR compliant device
cl_platform_id platform = ... ;
cl_device device = ... ;
cl_context context = … ;
// create our program with our SPIR bitcode file
cl_program program = clCreateProgramWithBinary(
context, 1, &device, &spir_bc_length, &spir_bc, NULL, NULL);
// build, passing arguments telling the compiler language is SPIR, and the SPIR standard we are using
clBuildProgram(program, 1, &device, “–x spir –spir–std=1.2”, NULL, NULL);
Neil Henning
neil@codeplay.com
39. How to enable your language on GPUs
// already initialized memory buffers for our context
cl_mem in_mem = ... ;
cl_mem out_mem = ... ;
// assume our kernel function from the spir kernel was called foo
cl_kernel kernel = clCreateKernel(program, “foo”, NULL);
// assume our kernel has one read buffer as first argument, and one write buffer as second
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void * )&in_mem);
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void * )&out_mem);
Neil Henning
neil@codeplay.com
40. How to enable your language on GPUs
// already initialized command queue
cl_command_queue queue = … ;
cl_event write_event, run_event;
clEnqueueWriteBuffer(queue, in_mem, CL_FALSE, 0, BUFFER_SIZE,
&read_payload, 0, NULL, &write_event);
const size_t size = BUFFER_SIZE / sizeof(cl_int);
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &size, NULL, 1, &write_event, &run_event);
clEnqueueReadBuffer(queue, out_mem, CL_TRUE, 0, BUFFER_SIZE,
&result_payload, 1, &run_event, NULL);
Neil Henning
neil@codeplay.com
41. How to enable your language on GPUs
●
Now, create a simple OpenCL kernel
void kernel foo(global int * in, global int * out)
{
out[get_global_id(0)] = in[get_global_id(0)];
}
●
And use Intel’s command line (or GUI!) tool to build
Ioc32 –cmd=build –input foo.cl –spir32=foo.bc
Neil Henning
neil@codeplay.com
42. How to enable your language on GPUs
●
Next we point the buffer for our SPIR kernel at the generated SPIR kernel
●
And it fails…?
●
Turns out Intel’s OpenCL runtime doesn’t like us telling them they are building
SPIR!
●
Simply remove “–x spir –spir–std=1.2” from the build options and voila!
Neil Henning
neil@codeplay.com
43. How to enable your language on GPUs
●
Next step – use tip Clang to build our foo.cl kernel
clang –cc1 –triple spir-unknown-unknown –emit-llvm-bc foo.cl –o foo.bc
●
Compiles ok, but when we run it fails…?
●
So Clang generated SPIR bitcode file could very well not work
●
We’ll take a look at the readable IR for the Intel & Clang compiled kernels
Neil Henning
neil@codeplay.com
47. How to enable your language on GPUs
●
So the metadata is different!
●
We could fix Clang to produce the right metadata…?
●
Or just hack around!
●
Lets use Intel’s compiler to generate a stub function
●
Then we can use an extern function defined in our Clang module!
Neil Henning
neil@codeplay.com
48. How to enable your language on GPUs
extern int doSomething(int a);
void kernel foo(global int * in, global int * out)
{
int id = get_global_id(0);
out[id] = doSomething(in[id]);
}
int doSomething(int a)
{
return a;
}
Neil Henning
neil@codeplay.com
49. How to enable your language on GPUs
●
And it fails…?
●
Intel’s compiler doesn’t like extern functions!
●
We’ve already bodged it thus far…
●
So lets continue!
Int __attribute__((weak)) doSomething(int a) {}
void kernel foo(global int * in, global int * out)
{
int id = get_global_id(0);
out[id] = doSomething(in[id]);
}
Neil Henning
neil@codeplay.com
50. How to enable your language on GPUs
●
More than a little nasty…
●
Relies on Clang extension to declare function weak within OpenCL
●
Relies on Intel using Clang and allowing extension
●
But it works!
●
Can build both the Intel stub code & the Clang actual code
●
Then use llvm-link to pull them together!
Neil Henning
neil@codeplay.com
51. How to enable your language on GPUs
●
So now we can compile two OpenCL kernels, link them together, and run it
●
What is next? Want to enable your language!
●
What about using Clang, but using a different language?
●
C & C++ come to mind!
Neil Henning
neil@codeplay.com
52. How to enable your language on GPUs
●
Use a simple C file
int doSomething(int a)
{
return a;
}
●
And use Clang to compile it
clang –cc1 –triple spir-unknown-unknown –emit-llvm-bc foo.c –o foo.bc
Neil Henning
neil@codeplay.com
53. How to enable your language on GPUs
●
Or a simple C++ file!
extern “C” int doSomething(int a);
template<typename T> T templatedSomething(const T t)
{
return t;
}
int doSomething(int a)
{
return templatedSomething(a);
}
Neil Henning
neil@codeplay.com
54. How to enable your language on GPUs
●
Lets have some real C++ code
●
Use features that OpenCL doesn’t provide us
We’ll do a matrix multiplication in C++
Use classes, constructors, templates
Neil Henning
neil@codeplay.com
55. How to enable your language on GPUs
typedef float __attribute__((ext_vector_type(4))) float4;
typedef float __attribute__((ext_vector_type(16))) float16;
float __attribute__((overloadable)) dot(float4 a, float4 b);
template<typename T, unsigned int WIDTH, unsigned int HEIGHT> class Matrix
{
typedef T __attribute__((ext_vector_type(WIDTH))) RowType;
RowType rows[HEIGHT];
public:
Matrix() {}
template<typename U> Matrix(const U & u) { __builtin_memcpy(&rows, &u, sizeof(U)); }
RowType & operator[](const unsigned int index) { return rows[index]; }
const RowType & operator[](const unsigned int index) const { return rows[index]; }
};
Neil Henning
neil@codeplay.com
56. How to enable your language on GPUs
template<typename T, unsigned int WIDTH, unsigned int HEIGHT>
Matrix<T, WIDTH, HEIGHT> operator *(const Matrix<T, WIDTH, HEIGHT> & a, const Matrix<T,
WIDTH, HEIGHT> & b)
{
Matrix<T, HEIGHT, WIDTH> bShuffled;
for(unsigned int h = 0; h < HEIGHT; h++)
for(unsigned int w = 0; w < WIDTH; w++)
bShuffled[w][h] = b[h][w];
Matrix<T, WIDTH, HEIGHT> result;
for(unsigned int h = 0; h < HEIGHT; h++)
for(unsigned int w = 0; w < WIDTH; w++)
result[h][w] = dot(a[h], bShuffled[w]);
return result;
}
Neil Henning
neil@codeplay.com
57. How to enable your language on GPUs
extern “C” float16 doSomething(float16 a, float16 b);
float16 doSomething(float16 a, float16 b)
{
Matrix<float, 4, 4> matA(a);
Matrix<float, 4, 4> matB(b);
Matrix<float, 4, 4> mul = matA * matB;
float16 result = (float16 )0;
result.s0123 = mul[0];
result.s4567 = mul[1];
result.s89ab = mul[2];
result.scdef = mul[3];
return result;
}
Neil Henning
neil@codeplay.com
58. How to enable your language on GPUs
●
And when we run it…
ex5.vcxproj -> E:AMDDeveloperSummit2013buildExample5Debugex5.exe
Found 2 platforms!
Choosing vendor 'Intel(R) Corporation'!
Found 1 devices!
SPIR file length '3948' bytes!
[ 0.0, 1.0, 2.0, 3.0] * [ 16.0, 15.0, 14.0, 13.0] = [ 40.0, 34.0, 28.0, 22.0]
[ 4.0, 5.0, 6.0, 7.0] * [ 12.0, 11.0, 10.0, 9.0] = [200.0, 178.0, 156.0, 134.0]
[ 8.0, 9.0, 10.0, 11.0] * [ 8.0, 7.0, 6.0, 5.0] = [360.0, 322.0, 284.0, 246.0]
[ 12.0, 13.0, 14.0, 15.0] * [ 4.0, 3.0, 2.0, 1.0] = [520.0, 466.0, 412.0, 358.0]
●
Success!
Neil Henning
neil@codeplay.com
59. How to enable your language on GPUs
●
The least you need to target a GPU;
●
Generate correct LLVM IR with SPIR
metadata
●
Or at least generate LLVM IR and
use the approach we used to
combine Clang and IOC generated
kernels
!opencl.kernels = !{!0}
!opencl.enable.FP_CONTRACT = !{}
!opencl.spir.version = !{!6}
!opencl.ocl.version = !{!7}
!opencl.used.extensions = !{!8}
!opencl.used.optional.core.features = !{!8}
!opencl.compiler.options = !{!8}
!0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)*
@foo, metadata !1, metadata !2, metadata !3, metadata !4,
metadata !5}
!1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32
1}
!2 = metadata !{metadata !"kernel_arg_access_qual", metadata
!"none", metadata !"none"}
!3 = metadata !{metadata !"kernel_arg_type", metadata !"int*",
metadata !"int*"}
!4 = metadata !{metadata !"kernel_arg_type_qual", metadata
!"", metadata !""}
!5 = metadata !{metadata !"kernel_arg_name", metadata !"a",
metadata !"b"}
!6 = metadata !{i32 1, i32 0}
!7 = metadata !{i32 0, i32 0}
!8 = metadata !{}
Neil Henning
neil@codeplay.com
60. How to enable your language on GPUs
●
Porting C/C++ libraries to SPIR requires a little more work
int foo(int * a)
{
return *a;
}
●
The data pointed to by ‘a’ will by default be put in the private address space
●
But a straight conversion to SPIR needs all data in global address space
●
Means that any porting of existing code could be quite intrusive
Neil Henning
neil@codeplay.com
61. How to enable your language on GPUs
●
To target your language at GPUs
●
●
Need to be able to segregate work into parallel chunks
●
Have to ban certain features that don’t work with compute
●
●
Need to deal with distinct address spaces
Language could also provide an API onto OpenCL SPIR builtins
But with OpenCL SPIR it is now possible to make any language work on a GPU!
Neil Henning
neil@codeplay.com
63. Developing tools for GPUs
●
Tools increasingly required to support development
●
Even having printf (which OpenCL 1.2 added) is novel!
●
But with increasingly complex code better tools needed
●
Main three are debuggers, profilers and compiler-tools
Neil Henning
neil@codeplay.com
64. Developing tools for GPUs
●
Debuggers for compute are difficult for non-vendor to develop
●
Codeplay has developed such tools on top of compute standards
●
Problem is bedrock for these tools can change at any time
●
Hard to beat vendor-owned approach that has lower-level access
Neil Henning
neil@codeplay.com
65. Developing tools for GPUs
Our
Language
●
Codeplay are pushing hard for HSA to have features
that aid tool development
●
Debuggers are much easier with instruction
support, debug info, change registers, call stacks
Our
Language
●
OpenCL SPIR harder to create debugger for without
vendor support
●
Can we standardize a way to debug OpenCL SPIR,
or allow debugging via emulation of SPIR?
Neil Henning
neil@codeplay.com
66. Developing tools for GPUs
●
Profilers require superset of debugger feature-set
●
Need to be able to trap kernels at defined points
●
Accurate timings only other requirement beyond debugger support
●
More fun when we go beyond performance, and measure power
Neil Henning
neil@codeplay.com
67. Developing tools for GPUs
●
HSA and OpenCL SPIR both good profiler targets
●
Could split SPIR kernels into profiling sections
●
Then use existing timing information in OpenCL
●
HSA will only require debugger features we are pushing for
Neil Henning
neil@codeplay.com
68. Developing tools for GPUs
●
Compiler tools consist of optimizers and analysis
●
Both HSA and OpenCL SPIR being based on LLVM enable this!
●
We as compiler experts can aid existing runtimes
●
You as developers can add optimizations & analyse your kernels!
Neil Henning
neil@codeplay.com
70. Conclusion
●
With the rise of open standards, compute is increasingly easy
●
With HSA & OpenCL SPIR hardware is finally open to us!
●
Just need standards to ratify, mature & be available on hardware!
●
Next big push into compute is upon us
Neil Henning
neil@codeplay.com