SlideShare a Scribd company logo
1 of 164
Java and GPU: where are we
now?
And why?
2
Dmitry Alexandrov
T-Systems | @bercut2000
3
4
5
What is a video card?
A video card (also called a display card, graphics card, display
adapter or graphics adapter) is an expansion card which generates a
feed of output images to a display (such as a computer monitor).
Frequently, these are advertised as discrete or dedicated graphics
cards, emphasizing the distinction between these and integrated
graphics.
6
What is a video card?
But as for today:
Video cards are not limited to simple image output, they have a built-in
graphics processor that can perform additional processing, removing
this task from the central processor of the computer.
7
So what does it do?
8
9
What is a GPU?
• Graphics Processing Unit
10
What is a GPU?
• Graphics Processing Unit
• First used by Nvidia in 1999
11
What is a GPU?
• Graphics Processing Unit
• First used by Nvidia in 1999
• GeForce 256 is called as «The world’s first GPU»
12
What is a GPU?
• Defined as “single-chip processor with integrated transform, lighting,
triangle setup/clipping, and rendering engines capable of processing
of 10000 polygons per second”
13
What is a GPU?
• Defined as “single-chip processor with integrated transform, lighting,
triangle setup/clipping, and rendering engines capable of processing
of 10000 polygons per second”
• ATI called them VPU..
14
By idea it looks like this
15
GPGPU
• General-purpose computing on graphics processing units
16
GPGPU
• General-purpose computing on graphics processing units
• Performs not only graphic calculations..
17
GPGPU
• General-purpose computing on graphics processing units
• Performs not only graphic calculations..
• … but also those usually performed on CPU
18
So much cool! We have to use
them!
19
Let’s look at the hardware!
20
Based on
“From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University
The CPU in general looks like this
21
How to convert?
22
Let’s simplify!
23
Then let’s just clone them
24
To make a lot of them!
25
But we are doing the same
calculation just with different
data
26
So we come to SIMD paradigm
27
So we use this paradigm
28
And here we start to talk about vectors..
29
… and in the and we are here:
30
Nice! But how on earth can we
code here?!
31
It all started with a shader
• Cool video cards were able to offload some of the tasks from the CPU
32
It all started with a shader
• Cool video cards were able to offload some of the tasks from the CPU
• But the most of the algorithms we just “hardcoded”
33
It all started with a shader
• Cool video cards were able to offload some of the tasks from the CPU
• But the most of the algorithms we just “hardcoded”
• They were considered “standard”
34
It all started with a shader
• Cool video cards were able to offload some of the tasks from the CPU
• But the most of the algorithms we just “hardcoded”
• They were considered “standard”
• Developers were able just to call them
35
It all started with a shader
• But its obvious, not everything can be done with “hardcoded” algorithms
36
It all started with a shader
• But its obvious, not everything can be done with “hardcoded” algorithms
• That’s why some of the vendors “opened access” for developers to use their own
algorithms with own programs
37
It all started with a shader
• But its obvious, not everything can be done with “hardcoded” algorithms
• That’s why some of the vendors “opened access” for developers to use their own
algorithms with own programs
• These programs are called Shaders
38
It all started with a shader
• But its obvious, not everything can be done with “hardcoded” algorithms
• That’s why some of the vendors “opened access” for developers to use their own
algorithms with own programs
• These programs are called Shaders
• From this moment video card could work on transformations, geometry and
textures as the developers want!
39
It all started with a shader
• First shadres were different:
• Vertex
• Geometry
• Pixel
• Then they were united to Common Shader Architecture
40
There are several shaders languages
• RenderMan
• OSL
• GLSL
• Cg
• DirectX ASM
• HLSL
• …
41
As an example:
42
With or without them
43
But they are so low level..
44
Having in mind it all started with
gaming…
45
Several abstractions were created:
• OpenGL
• is a cross-language, cross-platform application programming interface (API) for
rendering 2D and 3Dvector graphics. The API is typically used to interact with
a graphics processing unit (GPU), to achieve hardware-accelerated rendering.
• Silicon Graphics Inc., (SGI) started developing OpenGL in 1991 and released it in
January 1992;
• DirectX
• is a collection of application programming interfaces (APIs) for handling tasks related
to multimedia, especially game programming and video, on Microsoft platforms.
Originally, the names of these APIs all began with Direct, such
as Direct3D, DirectDraw, DirectMusic, DirectPlay, DirectSound, and so forth. The
name DirectX was coined as a shorthand term for all of these APIs (the X standing in
for the particular API names) and soon became the name of the collection.
46
By the way, what about Java?
47
OpenGL in Java
• JSR – 231
• Started in 2003
• Latest release in 2008
• Supports OpenGL 2.0
48
OpenGL
• Now is an independent project GOGL
• Supports OpenGL up to 4.5
• Provide support for GLU и GLUT
• Access to low level API on С via JNI
49
50
But somewhere in 2005 it was
finally realized this can be used
for general computations as well
51
BrookGPU
• Early efforts to use GPGPU
• Own subset of ANSI C
• Brook Streaming Language
• Made in Stanford University
52
GPGPU
• CUDA — Nvidia C subset proprietary platform.
• DirectCompute — Microsoft proprietary shader language, part of
Direct3d, starting from DirectX 10.
• AMD FireStream — ATI proprietary technology.
• OpenACC – multivendor consortium
• C++ AMP – Microsoft proprietary language
• OpenCL – Common standard controlled by Kronos group.
53
Why should we ever use GPU on Java
• Why Java
• Safe and secure
• Portability (“write once, run everywhere”)
• Used on 3 000 000 000 devices
54
Why should we ever use GPU on Java
• Why Java
• Safe and secure
• Portability (“write once, run everywhere”)
• Used on 3 000 000 000 devices
• Where can we apply GPU
• Data Analytics and Data Science (Hadoop, Spark …)
• Security analytics (log processing)
• Finance/Banking
55
For this we have:
56
But Java works on JVM.. But
there we have some low level..
57
For low level we use:
• JNI (Java Native Interface)
• JNA (Java Native Access)
58
But we can go crazy there..
59
Someone actually did this…
60
But may be there is something
done already?
61
For OpenCL:
• JOCL
• JogAmp
• JavaCL (not supported anymore)
62
.. and for Cuda
• JCuda
• Cublas
• JCufft
• JCurand
• JCusparse
• JCusolver
• Jnvgraph
• Jcudpp
• JNpp
• JCudnn
63
Disclaimer: its hard to work with GPU!
• Its not just run a program
• You need to know your hardware!
• Its low level..
64
Let’s start with:
65
What’s that?
• Short for Open Compute Language
• Consortium of Apple, nVidia, AMD, IBM, Intel, ARM, Motorola and
many more
• Very abstract model
• Works both on GPU and CPU
66
Should work on everything
67
All in all it works like this:
HOST DEVICE
Data
Program/Kernel
68
All in all it works like this:
HOST
69
All in all it works like this:
HOST DEVICE
Result
70
Typical lifecycle of an OpenCL app
• Create context
• Create command queue
• Create memory buffers/fill with data
• Create program from sources/load binaries
• Compile (if required)
• Create kernel from the program
• Supply kernel arguments
• Define ND range
• Execute
• Return resulting data
• Release resources
71
Better take a look
72
73
1. There is the host code. Its on
Java.
74
2. There is a device code. A
specific subset of C.
75
3. Communication between the
host and the device is done via
memory buffers.
76
So what can we actually transfer?
77
The data is not quite the same..
78
Datatypes: scalars
79
Datatypes:vectors
80
Datatypes:vectors
float f = 4.0f;
float3 f3 = (float3)(1.0f, 2.0f, 3.0f);
float4 f4 = (float4)(f3, f);
//f4.x = 1.0f,
//f4.y = 2.0f,
//f4.z = 3.0f,
//f4.w = 4.0f
81
So how are they saved there?
82
So how are they saved there?
In a hard way..
83
Memory Model
• __global
• __constant
• __local
• __private
84
Memory Model
85
But that’s not all
86
Remember SIMD?
87
Execution model
• We’ve got a lot of data
• We need to perform the same computations over them
• So we can just shard them
• OpenCL is here t help us
88
Execution model
89
ND Range – what is that?
90
For example: matrix multiplication
• We would write it like this:
void MatrixMul_sequential(int dim, float *A, float *B, float *C) {
for(int iRow=0; iRow<dim;++iRow) {
for(int iCol=0; iCol<dim;++iCol) {
float result = 0.f;
for(int i=0; i<dim;++i) {
result += A[iRow*dim + i]*B[i*dim + iCol];
}
C[iRow*dim + iCol] = result;
}
}
}
91
For example: matrix multiplication
92
For example: matrix multiplication
• So on GPU:
void MatrixMul_kernel_basic(int dim,
__global float *A, __global float *B, __global float *C) {
//Get the index of the work-item
int iCol = get_global_id(0);
int iRow = get_global_id(1);
float result = 0.0;
for(int i=0;i< dim;++i) {
result += A[iRow*dim + i]*B[i*dim + iCol];
}
C[iRow*dim + iCol] = result;
}
93
For example: matrix multiplication
• So on GPU:
void MatrixMul_kernel_basic(int dim,
__global float *A, __global float *B, __global float *C) {
//Get the index of the work-item
int iCol = get_global_id(0);
int iRow = get_global_id(1);
float result = 0.0;
for(int i=0;i< dim;++i) {
result += A[iRow*dim + i]*B[i*dim + iCol];
}
C[iRow*dim + iCol] = result;
}
94
Typical GPU
--- Info for device GeForce GT 650M: ---
CL_DEVICE_NAME: GeForce GT 650M
CL_DEVICE_VENDOR: NVIDIA
CL_DRIVER_VERSION: 10.14.20 355.10.05.15f03
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 900 MHz
CL_DEVICE_ADDRESS_BITS: 64
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 256 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 256
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16
CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF
CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT
CL_DEVICE_2D_MAX_WIDTH 16384
CL_DEVICE_2D_MAX_HEIGHT 16384
CL_DEVICE_3D_MAX_WIDTH 2048
CL_DEVICE_3D_MAX_HEIGHT 2048
CL_DEVICE_3D_MAX_DEPTH 2048
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1
95
Typical CPU
--- Info for device Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz: ---
CL_DEVICE_NAME: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
CL_DEVICE_VENDOR: Intel
CL_DRIVER_VERSION: 1.1
CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS: 8
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1 / 1
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 2600 MHz
CL_DEVICE_ADDRESS_BITS: 64
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 2048 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 8192 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: global
CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF
CL_FP_FMA CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT
CL_DEVICE_2D_MAX_WIDTH 8192
CL_DEVICE_2D_MAX_HEIGHT 8192
CL_DEVICE_3D_MAX_WIDTH 2048
CL_DEVICE_3D_MAX_HEIGHT 2048
CL_DEVICE_3D_MAX_DEPTH 2048
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
96
And what about CUDA?
97
And what about CUDA?
Well.. It looks to be easier
98
And what about CUDA?
Well.. It looks to be easier
for C developers…
99
CUDA kernel
#define N 10
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x; // this thread handles the data at its thread id
if (tid < N)
c[tid] = a[tid] + b[tid];
}
100
CUDA setup
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
// allocate the memory on the GPU
cudaMalloc( (void**)&dev_a, N * sizeof(int) );
cudaMalloc( (void**)&dev_b, N * sizeof(int) );
cudaMalloc( (void**)&dev_c, N * sizeof(int) );
// fill the arrays 'a' and 'b' on the CPU
for (int i=0; i<N; i++) {
a[i] = -i;
b[i] = i * i;
}
101
CUDA copy to memory and run
// copy the arrays 'a' and 'b' to the GPU
cudaMemcpy(dev_a, a, N *sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,b,N*sizeof(int),
cudaMemcpyHostToDevice);
add<<<N,1>>>(dev_a,dev_b,dev_c);
// copy the array 'c' back from the GPU to the CPU
cudaMemcpy(c,dev_c,N*sizeof(int),
cudaMemcpyDeviceToHost);
102
CUDA get results
// display the results
for (int i=0; i<N; i++) {
printf( "%d + %d = %dn", a[i], b[i], c[i] );
}
// free the memory allocated on the GPU
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
103
But CUDA has some other superpowers
• Cublas – all about matrices
• JCufft – Fast Frontier Transformation
• Jcurand – all about random
• JCusparse – sparse matrices
• Jcusolver – factorization and some other crazy stuff
• Jnvgraph – all about graphs
• Jcudpp – CUDA Data Parallel Primitives Library, and some sorting
• JNpp – image processing on GPU
• Jcudnn – Deep Neural Network library (that’s scary)
104
For example we need a good rand
int n = 100;
curandGenerator generator = new curandGenerator();
float hostData[] = new float[n];
Pointer deviceData = new Pointer();
cudaMalloc(deviceData, n * Sizeof.FLOAT);
curandCreateGenerator(generator, CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(generator, 1234);
curandGenerateUniform(generator, deviceData, n);
cudaMemcpy(Pointer.to(hostData), deviceData,
n * Sizeof.FLOAT, cudaMemcpyDeviceToHost);
System.out.println(Arrays.toString(hostData));
curandDestroyGenerator(generator);
cudaFree(deviceData);
105
For example we need a good rand
• With a strong theory underneath
• Developed by Russian mathematician Ilya Sobolev back in 1967
• https://en.wikipedia.org/wiki/Sobol_sequence
106
nVidia memory looks like this
107
Btw.. Talking about memory
108
©Wikipedia
Optimizations…
__kernel void MatrixMul_kernel_basic(int dim,
__global float *A,
__global float *B,
__global float *C){
int iCol = get_global_id(0);
int iRow = get_global_id(1);
float result = 0.0;
for(int i=0;i< dim;++i)
{
result +=
A[iRow*dim + i]*B[i*dim + iCol];
}
C[iRow*dim + iCol] = result;
}
109
<—Optimizations
#define VECTOR_SIZE 4
__kernel void MatrixMul_kernel_basic_vector4(int dim,
__global float4 *A,
__global float4 *B,
__global float *C)
int localIdx = get_global_id(0);
int localIdy = get_global_id(1);
float result = 0.0;
float4 Bvector[4];
float4 Avector, temp;
float4 resultVector[4] = {0,0,0,0};
int rowElements = dim/VECTOR_SIZE;
for(int i=0; i<rowElements; ++i){
Avector = A[localIdy*rowElements + i];
Bvector[0] = B[dim*i + localIdx];
Bvector[1] = B[dim*i + rowElements + localIdx];
Bvector[2] = B[dim*i + 2*rowElements + localIdx];
Bvector[3] = B[dim*i + 3*rowElements + localIdx];
temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x);
resultVector[0] += Avector * temp;
temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y);
resultVector[1] += Avector * temp;
temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z);
resultVector[2] += Avector * temp;
temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w);
resultVector[3] += Avector * temp;
}
C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w;
C[localIdy*dim + localIdx*VECTOR_SIZE + 1] = resultVector[1].x + resultVector[1].y + resultVector[1].z + resultVector[1].w;
C[localIdy*dim + localIdx*VECTOR_SIZE + 2] = resultVector[2].x + resultVector[2].y + resultVector[2].z + resultVector[2].w;
C[localIdy*dim + localIdx*VECTOR_SIZE + 3] = resultVector[3].x + resultVector[3].y + resultVector[3].z + resultVector[3].w;
} 110
<—Optimizations
#define VECTOR_SIZE 4
__kernel void MatrixMul_kernel_basic_vector4(int dim,
__global float4 *A,
__global float4 *B,
__global float *C)
int localIdx = get_global_id(0);
int localIdy = get_global_id(1);
float result = 0.0;
float4 Bvector[4];
float4 Avector, temp;
float4 resultVector[4] = {0,0,0,0};
int rowElements = dim/VECTOR_SIZE;
for(int i=0; i<rowElements; ++i){
Avector = A[localIdy*rowElements + i];
Bvector[0] = B[dim*i + localIdx];
Bvector[1] = B[dim*i + rowElements + localIdx];
Bvector[2] = B[dim*i + 2*rowElements + localIdx];
Bvector[3] = B[dim*i + 3*rowElements + localIdx];
temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x);
resultVector[0] += Avector * temp;
temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y);
resultVector[1] += Avector * temp;
temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z);
resultVector[2] += Avector * temp;
temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w);
resultVector[3] += Avector * temp;
}
C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w;
C[localIdy*dim + localIdx*VECTOR_SIZE + 1] = resultVector[1].x + resultVector[1].y + resultVector[1].z + resultVector[1].w;
C[localIdy*dim + localIdx*VECTOR_SIZE + 2] = resultVector[2].x + resultVector[2].y + resultVector[2].z + resultVector[2].w;
C[localIdy*dim + localIdx*VECTOR_SIZE + 3] = resultVector[3].x + resultVector[3].y + resultVector[3].z + resultVector[3].w;
} 111
But we don’t want to have C at
all…
112
We don’t want to think about
those hosts and devices…
113
We can use GPU partially..
114
Project Sumatra
• Research project
115
Project Sumatra
• Research project
• Focused on Java 8
116
Project Sumatra
• Research project
• Focused on Java 8
• … to be more precise on streams
117
Project Sumatra
• Research project
• Focused on Java 8
• … to be more precise on streams
• … and even more precise lambdas and .forEach()
118
AMD HSAIL
119
AMD HSAIL
120
AMD HSAIL
• Detects forEach() block
• Gets HSAIL code with Graal
• On low level supply the
generated from lambda
kernel to the GPU
121
AMD APU tries to solve the main issue..
122
©Wikipedia
But if we want some more
general solution..
123
IBM patched JVM for GPU
• Focused on CUDA (for now)
• Focused on Stream API
• Created their own .parallel()
124
IBM patched JVM for GPU
Imagine:
void fooJava(float A[], float B[], int n) {
// similar to for (idx = 0; i < n; i++)
IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; });
}
125
IBM patched JVM for GPU
Imagine:
void fooJava(float A[], float B[], int n) {
// similar to for (idx = 0; i < n; i++)
IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; });
}
… we would like the lambda to be automatically converted to GPU code…
126
IBM patched JVM for GPU
When n is big the lambda code is executed on GPU:
class Par {
void foo(float[] a, float[] b, float[] c, int n) {
IntStream.range(0, n).parallel()
.forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
}
}
*only lambdas with primitive types in one dimension arrays.
127
IBM patched JVM for GPU
Optimized IBM JIT compiler:
• Use read-only cache
• Fewer writes to global GPU memory
• Optimized Host to Device data copy rate
• Fewer data to be copied
• Eliminate exceptions as much as possible
• In the GPU Kernel
128
IBM patched JVM for GPU
• Success story:
+ +
129
IBM patched JVM for GPU
• Officially:
130
IBM patched JVM for GPU
• More info:
https://github.com/IBMSparkGPU/GPUEnabler
131
But can we just write in Java,
and its just being converted to
OpenCL/CUDA?
132
Yes, you can!
133
Aparapi is there for you!
134
Aparapi
• Short for «A PARallel API»
135
Aparapi
• Short for «A PARallel API»
• Works like Hibernate for databases
136
Aparapi
• Short for «A PARallel API»
• Works like Hibernate for databases
• Dynamically converts JVM Bytecode to code for Host and Device
137
Aparapi
• Short for «A PARallel API»
• Works like Hibernate for databases
• Dynamically converts JVM Bytecode to code for Host and Device
• OpenCL under the cover
138
Aparapi
• Started by AMD
139
Aparapi
• Started by AMD
• Then abandoned…
140
Aparapi
• Started by AMD
• Then abandoned…
• In 5 years Opensourced under Apache 2.0 license
141
Aparapi
• Started by AMD
• Then abandoned…
• In 5 years Opensourced under Apache 2.0 license
• Back to life!!!
142
Aparapi – now its so much simple!
public static void main(String[] _args) {
final int size = 512;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
final float[] sum = new float[size];
Kernel kernel = new Kernel(){
@Override public void run() {
int gid = getGlobalId();
sum[gid] = a[gid] + b[gid];
}
};
kernel.execute(Range.create(size));
for (int i = 0; i < size; i++) {
System.out.printf("%6.2f + %6.2f = %8.2fn", a[i], b[i], sum[i]);
}
kernel.dispose();
}
143
But what about the clouds?
144
We can’t sell our product if its
not cloud native!
145
nVidia is your friend!
146
nVidia GRID
• Announced in 2012
• Already in production
• Works on the most of the
hypervisors
• .. And in the clouds!
147
nVidia GRID
148
nVidia GRID
149
… AMD is a bit behind…
150
Anyway, its here!
151
Its here: Nvidia GPU
152
Its here : ATI Radeon
153
Its here: AMD APU
154
Its here: Intel Skylake
155
Its here: Nvidia Tegra Parker
156
Intel with VEGA??
157
But first read:
158
So use it!
159
So use it!
If the task is suitable
160
…its hard,
but worth it!
161
You will rule’em’all!
162
Thanks!
Dank je!
Merci beaucoup!
163
164

More Related Content

What's hot

[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
Using Smalltalk for controlling robotics systems
Using Smalltalk for controlling robotics systemsUsing Smalltalk for controlling robotics systems
Using Smalltalk for controlling robotics systemsSerge Stinckwich
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
Kinect Hacks for Dummies
Kinect Hacks for DummiesKinect Hacks for Dummies
Kinect Hacks for DummiesTomoto Washio
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Tomasz Bednarz
 
You Can’t Do That With Smalltalk!
You Can’t Do That With Smalltalk!You Can’t Do That With Smalltalk!
You Can’t Do That With Smalltalk!ESUG
 
OpenCL - The Open Standard for Heterogeneous Parallel Programming
OpenCL - The Open Standard for Heterogeneous Parallel ProgrammingOpenCL - The Open Standard for Heterogeneous Parallel Programming
OpenCL - The Open Standard for Heterogeneous Parallel ProgrammingAndreas Schreiber
 
Perceptual Computing Workshop à Paris
Perceptual Computing Workshop à ParisPerceptual Computing Workshop à Paris
Perceptual Computing Workshop à ParisBeMyApp
 
Ultra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-Resolution
Ultra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-ResolutionUltra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-Resolution
Ultra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-ResolutionIntel® Software
 
Perceptual Computing Workshop in Munich
Perceptual Computing Workshop in MunichPerceptual Computing Workshop in Munich
Perceptual Computing Workshop in MunichBeMyApp
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 
Optimizing Total War*: WARHAMMER II
Optimizing Total War*: WARHAMMER IIOptimizing Total War*: WARHAMMER II
Optimizing Total War*: WARHAMMER IIIntel® Software
 
Getting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® GraphicsGetting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® GraphicsIntel® Software
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...AMD Developer Central
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...AMD Developer Central
 

What's hot (20)

[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
Using Smalltalk for controlling robotics systems
Using Smalltalk for controlling robotics systemsUsing Smalltalk for controlling robotics systems
Using Smalltalk for controlling robotics systems
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 
Kinect Hacks for Dummies
Kinect Hacks for DummiesKinect Hacks for Dummies
Kinect Hacks for Dummies
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
You Can’t Do That With Smalltalk!
You Can’t Do That With Smalltalk!You Can’t Do That With Smalltalk!
You Can’t Do That With Smalltalk!
 
OpenCL - The Open Standard for Heterogeneous Parallel Programming
OpenCL - The Open Standard for Heterogeneous Parallel ProgrammingOpenCL - The Open Standard for Heterogeneous Parallel Programming
OpenCL - The Open Standard for Heterogeneous Parallel Programming
 
GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
 
Masked Occlusion Culling
Masked Occlusion CullingMasked Occlusion Culling
Masked Occlusion Culling
 
Perceptual Computing Workshop à Paris
Perceptual Computing Workshop à ParisPerceptual Computing Workshop à Paris
Perceptual Computing Workshop à Paris
 
Ultra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-Resolution
Ultra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-ResolutionUltra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-Resolution
Ultra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-Resolution
 
nodebots presentation @seekjobs
nodebots presentation @seekjobsnodebots presentation @seekjobs
nodebots presentation @seekjobs
 
Perceptual Computing Workshop in Munich
Perceptual Computing Workshop in MunichPerceptual Computing Workshop in Munich
Perceptual Computing Workshop in Munich
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
Optimizing Total War*: WARHAMMER II
Optimizing Total War*: WARHAMMER IIOptimizing Total War*: WARHAMMER II
Optimizing Total War*: WARHAMMER II
 
Getting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® GraphicsGetting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® Graphics
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
 

Viewers also liked

HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPCHPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPCHPC DAY
 
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. AvailabilityHPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. AvailabilityHPC DAY
 
Latency tracing in distributed Java applications
Latency tracing in distributed Java applicationsLatency tracing in distributed Java applications
Latency tracing in distributed Java applicationsConstantine Slisenka
 
Libnetwork updates
Libnetwork updatesLibnetwork updates
Libnetwork updatesMoby Project
 
Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...
Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...
Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...Benoit Combemale
 
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC ComputingHPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC ComputingHPC DAY
 
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY
 
Database Security Threats - MariaDB Security Best Practices
Database Security Threats - MariaDB Security Best PracticesDatabase Security Threats - MariaDB Security Best Practices
Database Security Threats - MariaDB Security Best PracticesMariaDB plc
 
HPC DAY 2017 | Prometheus - energy efficient supercomputing
HPC DAY 2017 | Prometheus - energy efficient supercomputingHPC DAY 2017 | Prometheus - energy efficient supercomputing
HPC DAY 2017 | Prometheus - energy efficient supercomputingHPC DAY
 
LinuxKit and OpenOverlay
LinuxKit and OpenOverlayLinuxKit and OpenOverlay
LinuxKit and OpenOverlayMoby Project
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY
 
GPU databases - How to use them and what the future holds
GPU databases - How to use them and what the future holdsGPU databases - How to use them and what the future holds
GPU databases - How to use them and what the future holdsArnon Shimoni
 
Design patterns in Java - Monitis 2017
Design patterns in Java - Monitis 2017Design patterns in Java - Monitis 2017
Design patterns in Java - Monitis 2017Arsen Gasparyan
 
Getting Started with Embedded Python: MicroPython and CircuitPython
Getting Started with Embedded Python: MicroPython and CircuitPythonGetting Started with Embedded Python: MicroPython and CircuitPython
Getting Started with Embedded Python: MicroPython and CircuitPythonAyan Pahwa
 
세션1. block chain as a platform
세션1. block chain as a platform세션1. block chain as a platform
세션1. block chain as a platformJay JH Park
 
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...ScyllaDB
 

Viewers also liked (20)

HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPCHPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
 
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. AvailabilityHPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
 
Latency tracing in distributed Java applications
Latency tracing in distributed Java applicationsLatency tracing in distributed Java applications
Latency tracing in distributed Java applications
 
Libnetwork updates
Libnetwork updatesLibnetwork updates
Libnetwork updates
 
Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...
Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...
Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...
 
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC ComputingHPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
 
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
 
Raspberry home server
Raspberry home serverRaspberry home server
Raspberry home server
 
Database Security Threats - MariaDB Security Best Practices
Database Security Threats - MariaDB Security Best PracticesDatabase Security Threats - MariaDB Security Best Practices
Database Security Threats - MariaDB Security Best Practices
 
HPC DAY 2017 | Prometheus - energy efficient supercomputing
HPC DAY 2017 | Prometheus - energy efficient supercomputingHPC DAY 2017 | Prometheus - energy efficient supercomputing
HPC DAY 2017 | Prometheus - energy efficient supercomputing
 
LinuxKit and OpenOverlay
LinuxKit and OpenOverlayLinuxKit and OpenOverlay
LinuxKit and OpenOverlay
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
 
GPU databases - How to use them and what the future holds
GPU databases - How to use them and what the future holdsGPU databases - How to use them and what the future holds
GPU databases - How to use them and what the future holds
 
Design patterns in Java - Monitis 2017
Design patterns in Java - Monitis 2017Design patterns in Java - Monitis 2017
Design patterns in Java - Monitis 2017
 
Getting Started with Embedded Python: MicroPython and CircuitPython
Getting Started with Embedded Python: MicroPython and CircuitPythonGetting Started with Embedded Python: MicroPython and CircuitPython
Getting Started with Embedded Python: MicroPython and CircuitPython
 
An Introduction to OMNeT++ 5.1
An Introduction to OMNeT++ 5.1An Introduction to OMNeT++ 5.1
An Introduction to OMNeT++ 5.1
 
Drive into calico architecture
Drive into calico architectureDrive into calico architecture
Drive into calico architecture
 
Vertx
VertxVertx
Vertx
 
세션1. block chain as a platform
세션1. block chain as a platform세션1. block chain as a platform
세션1. block chain as a platform
 
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
 

Similar to Java on the GPU: Where are we now?

Add sale davinci
Add sale davinciAdd sale davinci
Add sale davinciAkash Sahoo
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Oh the compilers you'll build
Oh the compilers you'll buildOh the compilers you'll build
Oh the compilers you'll buildMark Stoodley
 
Introduction to Computing on GPU
Introduction to Computing on GPUIntroduction to Computing on GPU
Introduction to Computing on GPUIlya Kuzovkin
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
Quick prototyping using Gadgeteer, Raspberry Pi + Fez Cream
Quick prototyping using Gadgeteer, Raspberry Pi + Fez CreamQuick prototyping using Gadgeteer, Raspberry Pi + Fez Cream
Quick prototyping using Gadgeteer, Raspberry Pi + Fez CreamMif Masterz
 
Lets have a look at Apple's Metal Framework
Lets have a look at Apple's Metal FrameworkLets have a look at Apple's Metal Framework
Lets have a look at Apple's Metal FrameworkLINE Corporation
 
Vulnerabilities of machine learning infrastructure
Vulnerabilities of machine learning infrastructureVulnerabilities of machine learning infrastructure
Vulnerabilities of machine learning infrastructureSergey Gordeychik
 
GPU Programming: CocoaConf Atlanta
GPU Programming: CocoaConf AtlantaGPU Programming: CocoaConf Atlanta
GPU Programming: CocoaConf AtlantaJanie Clayton
 
10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)Akhila Dakshina
 
Ice Age melting down: Intel features considered usefull!
Ice Age melting down: Intel features considered usefull!Ice Age melting down: Intel features considered usefull!
Ice Age melting down: Intel features considered usefull!Peter Hlavaty
 
Linxu conj2016 96boards
Linxu conj2016 96boardsLinxu conj2016 96boards
Linxu conj2016 96boardsLF Events
 
Developing Next-Generation Games with Stage3D (Molehill)
Developing Next-Generation Games with Stage3D (Molehill) Developing Next-Generation Games with Stage3D (Molehill)
Developing Next-Generation Games with Stage3D (Molehill) Jean-Philippe Doiron
 
Build an Open Hardware GNU/Linux PowerPC Notebook
Build an Open Hardware GNU/Linux PowerPC NotebookBuild an Open Hardware GNU/Linux PowerPC Notebook
Build an Open Hardware GNU/Linux PowerPC NotebookRoberto Innocenti
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiTakuya ASADA
 
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...Edge AI and Vision Alliance
 

Similar to Java on the GPU: Where are we now? (20)

Add sale davinci
Add sale davinciAdd sale davinci
Add sale davinci
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Oh the compilers you'll build
Oh the compilers you'll buildOh the compilers you'll build
Oh the compilers you'll build
 
Introduction to Computing on GPU
Introduction to Computing on GPUIntroduction to Computing on GPU
Introduction to Computing on GPU
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
Quick prototyping using Gadgeteer, Raspberry Pi + Fez Cream
Quick prototyping using Gadgeteer, Raspberry Pi + Fez CreamQuick prototyping using Gadgeteer, Raspberry Pi + Fez Cream
Quick prototyping using Gadgeteer, Raspberry Pi + Fez Cream
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Lets have a look at Apple's Metal Framework
Lets have a look at Apple's Metal FrameworkLets have a look at Apple's Metal Framework
Lets have a look at Apple's Metal Framework
 
Vulnerabilities of machine learning infrastructure
Vulnerabilities of machine learning infrastructureVulnerabilities of machine learning infrastructure
Vulnerabilities of machine learning infrastructure
 
GPU Programming: CocoaConf Atlanta
GPU Programming: CocoaConf AtlantaGPU Programming: CocoaConf Atlanta
GPU Programming: CocoaConf Atlanta
 
What is OpenGL ?
What is OpenGL ?What is OpenGL ?
What is OpenGL ?
 
10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)
 
Ice Age melting down: Intel features considered usefull!
Ice Age melting down: Intel features considered usefull!Ice Age melting down: Intel features considered usefull!
Ice Age melting down: Intel features considered usefull!
 
Linxu conj2016 96boards
Linxu conj2016 96boardsLinxu conj2016 96boards
Linxu conj2016 96boards
 
Developing Next-Generation Games with Stage3D (Molehill)
Developing Next-Generation Games with Stage3D (Molehill) Developing Next-Generation Games with Stage3D (Molehill)
Developing Next-Generation Games with Stage3D (Molehill)
 
Build an Open Hardware GNU/Linux PowerPC Notebook
Build an Open Hardware GNU/Linux PowerPC NotebookBuild an Open Hardware GNU/Linux PowerPC Notebook
Build an Open Hardware GNU/Linux PowerPC Notebook
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgi
 
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
 
Mesa and Its Debugging
Mesa and Its DebuggingMesa and Its Debugging
Mesa and Its Debugging
 

Recently uploaded

SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 

Recently uploaded (20)

SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 

Java on the GPU: Where are we now?

  • 1.
  • 2. Java and GPU: where are we now? And why? 2
  • 4. 4
  • 5. 5
  • 6. What is a video card? A video card (also called a display card, graphics card, display adapter or graphics adapter) is an expansion card which generates a feed of output images to a display (such as a computer monitor). Frequently, these are advertised as discrete or dedicated graphics cards, emphasizing the distinction between these and integrated graphics. 6
  • 7. What is a video card? But as for today: Video cards are not limited to simple image output, they have a built-in graphics processor that can perform additional processing, removing this task from the central processor of the computer. 7
  • 8. So what does it do? 8
  • 9. 9
  • 10. What is a GPU? • Graphics Processing Unit 10
  • 11. What is a GPU? • Graphics Processing Unit • First used by Nvidia in 1999 11
  • 12. What is a GPU? • Graphics Processing Unit • First used by Nvidia in 1999 • GeForce 256 is called as «The world’s first GPU» 12
  • 13. What is a GPU? • Defined as “single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines capable of processing of 10000 polygons per second” 13
  • 14. What is a GPU? • Defined as “single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines capable of processing of 10000 polygons per second” • ATI called them VPU.. 14
  • 15. By idea it looks like this 15
  • 16. GPGPU • General-purpose computing on graphics processing units 16
  • 17. GPGPU • General-purpose computing on graphics processing units • Performs not only graphic calculations.. 17
  • 18. GPGPU • General-purpose computing on graphics processing units • Performs not only graphic calculations.. • … but also those usually performed on CPU 18
  • 19. So much cool! We have to use them! 19
  • 20. Let’s look at the hardware! 20 Based on “From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University
  • 21. The CPU in general looks like this 21
  • 24. Then let’s just clone them 24
  • 25. To make a lot of them! 25
  • 26. But we are doing the same calculation just with different data 26
  • 27. So we come to SIMD paradigm 27
  • 28. So we use this paradigm 28
  • 29. And here we start to talk about vectors.. 29
  • 30. … and in the and we are here: 30
  • 31. Nice! But how on earth can we code here?! 31
  • 32. It all started with a shader • Cool video cards were able to offload some of the tasks from the CPU 32
  • 33. It all started with a shader • Cool video cards were able to offload some of the tasks from the CPU • But the most of the algorithms we just “hardcoded” 33
  • 34. It all started with a shader • Cool video cards were able to offload some of the tasks from the CPU • But the most of the algorithms we just “hardcoded” • They were considered “standard” 34
  • 35. It all started with a shader • Cool video cards were able to offload some of the tasks from the CPU • But the most of the algorithms we just “hardcoded” • They were considered “standard” • Developers were able just to call them 35
  • 36. It all started with a shader • But its obvious, not everything can be done with “hardcoded” algorithms 36
  • 37. It all started with a shader • But its obvious, not everything can be done with “hardcoded” algorithms • That’s why some of the vendors “opened access” for developers to use their own algorithms with own programs 37
  • 38. It all started with a shader • But its obvious, not everything can be done with “hardcoded” algorithms • That’s why some of the vendors “opened access” for developers to use their own algorithms with own programs • These programs are called Shaders 38
  • 39. It all started with a shader • But its obvious, not everything can be done with “hardcoded” algorithms • That’s why some of the vendors “opened access” for developers to use their own algorithms with own programs • These programs are called Shaders • From this moment video card could work on transformations, geometry and textures as the developers want! 39
  • 40. It all started with a shader • First shadres were different: • Vertex • Geometry • Pixel • Then they were united to Common Shader Architecture 40
  • 41. There are several shaders languages • RenderMan • OSL • GLSL • Cg • DirectX ASM • HLSL • … 41
  • 43. With or without them 43
  • 44. But they are so low level.. 44
  • 45. Having in mind it all started with gaming… 45
  • 46. Several abstractions were created: • OpenGL • is a cross-language, cross-platform application programming interface (API) for rendering 2D and 3Dvector graphics. The API is typically used to interact with a graphics processing unit (GPU), to achieve hardware-accelerated rendering. • Silicon Graphics Inc., (SGI) started developing OpenGL in 1991 and released it in January 1992; • DirectX • is a collection of application programming interfaces (APIs) for handling tasks related to multimedia, especially game programming and video, on Microsoft platforms. Originally, the names of these APIs all began with Direct, such as Direct3D, DirectDraw, DirectMusic, DirectPlay, DirectSound, and so forth. The name DirectX was coined as a shorthand term for all of these APIs (the X standing in for the particular API names) and soon became the name of the collection. 46
  • 47. By the way, what about Java? 47
  • 48. OpenGL in Java • JSR – 231 • Started in 2003 • Latest release in 2008 • Supports OpenGL 2.0 48
  • 49. OpenGL • Now is an independent project GOGL • Supports OpenGL up to 4.5 • Provide support for GLU и GLUT • Access to low level API on С via JNI 49
  • 50. 50
  • 51. But somewhere in 2005 it was finally realized this can be used for general computations as well 51
  • 52. BrookGPU • Early efforts to use GPGPU • Own subset of ANSI C • Brook Streaming Language • Made in Stanford University 52
  • 53. GPGPU • CUDA — Nvidia C subset proprietary platform. • DirectCompute — Microsoft proprietary shader language, part of Direct3d, starting from DirectX 10. • AMD FireStream — ATI proprietary technology. • OpenACC – multivendor consortium • C++ AMP – Microsoft proprietary language • OpenCL – Common standard controlled by Kronos group. 53
  • 54. Why should we ever use GPU on Java • Why Java • Safe and secure • Portability (“write once, run everywhere”) • Used on 3 000 000 000 devices 54
  • 55. Why should we ever use GPU on Java • Why Java • Safe and secure • Portability (“write once, run everywhere”) • Used on 3 000 000 000 devices • Where can we apply GPU • Data Analytics and Data Science (Hadoop, Spark …) • Security analytics (log processing) • Finance/Banking 55
  • 56. For this we have: 56
  • 57. But Java works on JVM.. But there we have some low level.. 57
  • 58. For low level we use: • JNI (Java Native Interface) • JNA (Java Native Access) 58
  • 59. But we can go crazy there.. 59
  • 60. Someone actually did this… 60
  • 61. But may be there is something done already? 61
  • 62. For OpenCL: • JOCL • JogAmp • JavaCL (not supported anymore) 62
  • 63. .. and for Cuda • JCuda • Cublas • JCufft • JCurand • JCusparse • JCusolver • Jnvgraph • Jcudpp • JNpp • JCudnn 63
  • 64. Disclaimer: its hard to work with GPU! • Its not just run a program • You need to know your hardware! • Its low level.. 64
  • 66. What’s that? • Short for Open Compute Language • Consortium of Apple, nVidia, AMD, IBM, Intel, ARM, Motorola and many more • Very abstract model • Works both on GPU and CPU 66
  • 67. Should work on everything 67
  • 68. All in all it works like this: HOST DEVICE Data Program/Kernel 68
  • 69. All in all it works like this: HOST 69
  • 70. All in all it works like this: HOST DEVICE Result 70
  • 71. Typical lifecycle of an OpenCL app • Create context • Create command queue • Create memory buffers/fill with data • Create program from sources/load binaries • Compile (if required) • Create kernel from the program • Supply kernel arguments • Define ND range • Execute • Return resulting data • Release resources 71
  • 72. Better take a look 72
  • 73. 73
  • 74. 1. There is the host code. Its on Java. 74
  • 75. 2. There is a device code. A specific subset of C. 75
  • 76. 3. Communication between the host and the device is done via memory buffers. 76
  • 77. So what can we actually transfer? 77
  • 78. The data is not quite the same.. 78
  • 81. Datatypes:vectors float f = 4.0f; float3 f3 = (float3)(1.0f, 2.0f, 3.0f); float4 f4 = (float4)(f3, f); //f4.x = 1.0f, //f4.y = 2.0f, //f4.z = 3.0f, //f4.w = 4.0f 81
  • 82. So how are they saved there? 82
  • 83. So how are they saved there? In a hard way.. 83
  • 84. Memory Model • __global • __constant • __local • __private 84
  • 88. Execution model • We’ve got a lot of data • We need to perform the same computations over them • So we can just shard them • OpenCL is here t help us 88
  • 90. ND Range – what is that? 90
  • 91. For example: matrix multiplication • We would write it like this: void MatrixMul_sequential(int dim, float *A, float *B, float *C) { for(int iRow=0; iRow<dim;++iRow) { for(int iCol=0; iCol<dim;++iCol) { float result = 0.f; for(int i=0; i<dim;++i) { result += A[iRow*dim + i]*B[i*dim + iCol]; } C[iRow*dim + iCol] = result; } } } 91
  • 92. For example: matrix multiplication 92
  • 93. For example: matrix multiplication • So on GPU: void MatrixMul_kernel_basic(int dim, __global float *A, __global float *B, __global float *C) { //Get the index of the work-item int iCol = get_global_id(0); int iRow = get_global_id(1); float result = 0.0; for(int i=0;i< dim;++i) { result += A[iRow*dim + i]*B[i*dim + iCol]; } C[iRow*dim + iCol] = result; } 93
  • 94. For example: matrix multiplication • So on GPU: void MatrixMul_kernel_basic(int dim, __global float *A, __global float *B, __global float *C) { //Get the index of the work-item int iCol = get_global_id(0); int iRow = get_global_id(1); float result = 0.0; for(int i=0;i< dim;++i) { result += A[iRow*dim + i]*B[i*dim + iCol]; } C[iRow*dim + iCol] = result; } 94
  • 95. Typical GPU --- Info for device GeForce GT 650M: --- CL_DEVICE_NAME: GeForce GT 650M CL_DEVICE_VENDOR: NVIDIA CL_DRIVER_VERSION: 10.14.20 355.10.05.15f03 CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU CL_DEVICE_MAX_COMPUTE_UNITS: 2 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64 CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024 CL_DEVICE_MAX_CLOCK_FREQUENCY: 900 MHz CL_DEVICE_ADDRESS_BITS: 64 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 256 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: local CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 256 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT CL_DEVICE_2D_MAX_WIDTH 16384 CL_DEVICE_2D_MAX_HEIGHT 16384 CL_DEVICE_3D_MAX_WIDTH 2048 CL_DEVICE_3D_MAX_HEIGHT 2048 CL_DEVICE_3D_MAX_DEPTH 2048 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1 95
  • 96. Typical CPU --- Info for device Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz: --- CL_DEVICE_NAME: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CL_DEVICE_VENDOR: Intel CL_DRIVER_VERSION: 1.1 CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU CL_DEVICE_MAX_COMPUTE_UNITS: 8 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1 / 1 CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024 CL_DEVICE_MAX_CLOCK_FREQUENCY: 2600 MHz CL_DEVICE_ADDRESS_BITS: 64 CL_DEVICE_MAX_MEM_ALLOC_SIZE: 2048 MByte CL_DEVICE_GLOBAL_MEM_SIZE: 8192 MByte CL_DEVICE_ERROR_CORRECTION_SUPPORT: no CL_DEVICE_LOCAL_MEM_TYPE: global CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE CL_DEVICE_IMAGE_SUPPORT: 1 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8 CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT CL_DEVICE_2D_MAX_WIDTH 8192 CL_DEVICE_2D_MAX_HEIGHT 8192 CL_DEVICE_3D_MAX_WIDTH 2048 CL_DEVICE_3D_MAX_HEIGHT 2048 CL_DEVICE_3D_MAX_DEPTH 2048 CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2 96
  • 97. And what about CUDA? 97
  • 98. And what about CUDA? Well.. It looks to be easier 98
  • 99. And what about CUDA? Well.. It looks to be easier for C developers… 99
  • 100. CUDA kernel #define N 10 __global__ void add( int *a, int *b, int *c ) { int tid = blockIdx.x; // this thread handles the data at its thread id if (tid < N) c[tid] = a[tid] + b[tid]; } 100
  • 101. CUDA setup int a[N], b[N], c[N]; int *dev_a, *dev_b, *dev_c; // allocate the memory on the GPU cudaMalloc( (void**)&dev_a, N * sizeof(int) ); cudaMalloc( (void**)&dev_b, N * sizeof(int) ); cudaMalloc( (void**)&dev_c, N * sizeof(int) ); // fill the arrays 'a' and 'b' on the CPU for (int i=0; i<N; i++) { a[i] = -i; b[i] = i * i; } 101
  • 102. CUDA copy to memory and run // copy the arrays 'a' and 'b' to the GPU cudaMemcpy(dev_a, a, N *sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(dev_b,b,N*sizeof(int), cudaMemcpyHostToDevice); add<<<N,1>>>(dev_a,dev_b,dev_c); // copy the array 'c' back from the GPU to the CPU cudaMemcpy(c,dev_c,N*sizeof(int), cudaMemcpyDeviceToHost); 102
  • 103. CUDA get results // display the results for (int i=0; i<N; i++) { printf( "%d + %d = %dn", a[i], b[i], c[i] ); } // free the memory allocated on the GPU cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); 103
  • 104. But CUDA has some other superpowers • Cublas – all about matrices • JCufft – Fast Frontier Transformation • Jcurand – all about random • JCusparse – sparse matrices • Jcusolver – factorization and some other crazy stuff • Jnvgraph – all about graphs • Jcudpp – CUDA Data Parallel Primitives Library, and some sorting • JNpp – image processing on GPU • Jcudnn – Deep Neural Network library (that’s scary) 104
  • 105. For example we need a good rand int n = 100; curandGenerator generator = new curandGenerator(); float hostData[] = new float[n]; Pointer deviceData = new Pointer(); cudaMalloc(deviceData, n * Sizeof.FLOAT); curandCreateGenerator(generator, CURAND_RNG_PSEUDO_DEFAULT); curandSetPseudoRandomGeneratorSeed(generator, 1234); curandGenerateUniform(generator, deviceData, n); cudaMemcpy(Pointer.to(hostData), deviceData, n * Sizeof.FLOAT, cudaMemcpyDeviceToHost); System.out.println(Arrays.toString(hostData)); curandDestroyGenerator(generator); cudaFree(deviceData); 105
  • 106. For example we need a good rand • With a strong theory underneath • Developed by Russian mathematician Ilya Sobolev back in 1967 • https://en.wikipedia.org/wiki/Sobol_sequence 106
  • 107. nVidia memory looks like this 107
  • 108. Btw.. Talking about memory 108 ©Wikipedia
  • 109. Optimizations… __kernel void MatrixMul_kernel_basic(int dim, __global float *A, __global float *B, __global float *C){ int iCol = get_global_id(0); int iRow = get_global_id(1); float result = 0.0; for(int i=0;i< dim;++i) { result += A[iRow*dim + i]*B[i*dim + iCol]; } C[iRow*dim + iCol] = result; } 109
  • 110. <—Optimizations #define VECTOR_SIZE 4 __kernel void MatrixMul_kernel_basic_vector4(int dim, __global float4 *A, __global float4 *B, __global float *C) int localIdx = get_global_id(0); int localIdy = get_global_id(1); float result = 0.0; float4 Bvector[4]; float4 Avector, temp; float4 resultVector[4] = {0,0,0,0}; int rowElements = dim/VECTOR_SIZE; for(int i=0; i<rowElements; ++i){ Avector = A[localIdy*rowElements + i]; Bvector[0] = B[dim*i + localIdx]; Bvector[1] = B[dim*i + rowElements + localIdx]; Bvector[2] = B[dim*i + 2*rowElements + localIdx]; Bvector[3] = B[dim*i + 3*rowElements + localIdx]; temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x); resultVector[0] += Avector * temp; temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y); resultVector[1] += Avector * temp; temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z); resultVector[2] += Avector * temp; temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w); resultVector[3] += Avector * temp; } C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w; C[localIdy*dim + localIdx*VECTOR_SIZE + 1] = resultVector[1].x + resultVector[1].y + resultVector[1].z + resultVector[1].w; C[localIdy*dim + localIdx*VECTOR_SIZE + 2] = resultVector[2].x + resultVector[2].y + resultVector[2].z + resultVector[2].w; C[localIdy*dim + localIdx*VECTOR_SIZE + 3] = resultVector[3].x + resultVector[3].y + resultVector[3].z + resultVector[3].w; } 110
  • 111. <—Optimizations #define VECTOR_SIZE 4 __kernel void MatrixMul_kernel_basic_vector4(int dim, __global float4 *A, __global float4 *B, __global float *C) int localIdx = get_global_id(0); int localIdy = get_global_id(1); float result = 0.0; float4 Bvector[4]; float4 Avector, temp; float4 resultVector[4] = {0,0,0,0}; int rowElements = dim/VECTOR_SIZE; for(int i=0; i<rowElements; ++i){ Avector = A[localIdy*rowElements + i]; Bvector[0] = B[dim*i + localIdx]; Bvector[1] = B[dim*i + rowElements + localIdx]; Bvector[2] = B[dim*i + 2*rowElements + localIdx]; Bvector[3] = B[dim*i + 3*rowElements + localIdx]; temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x); resultVector[0] += Avector * temp; temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y); resultVector[1] += Avector * temp; temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z); resultVector[2] += Avector * temp; temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w); resultVector[3] += Avector * temp; } C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w; C[localIdy*dim + localIdx*VECTOR_SIZE + 1] = resultVector[1].x + resultVector[1].y + resultVector[1].z + resultVector[1].w; C[localIdy*dim + localIdx*VECTOR_SIZE + 2] = resultVector[2].x + resultVector[2].y + resultVector[2].z + resultVector[2].w; C[localIdy*dim + localIdx*VECTOR_SIZE + 3] = resultVector[3].x + resultVector[3].y + resultVector[3].z + resultVector[3].w; } 111
  • 112. But we don’t want to have C at all… 112
  • 113. We don’t want to think about those hosts and devices… 113
  • 114. We can use GPU partially.. 114
  • 116. Project Sumatra • Research project • Focused on Java 8 116
  • 117. Project Sumatra • Research project • Focused on Java 8 • … to be more precise on streams 117
  • 118. Project Sumatra • Research project • Focused on Java 8 • … to be more precise on streams • … and even more precise lambdas and .forEach() 118
  • 121. AMD HSAIL • Detects forEach() block • Gets HSAIL code with Graal • On low level supply the generated from lambda kernel to the GPU 121
  • 122. AMD APU tries to solve the main issue.. 122 ©Wikipedia
  • 123. But if we want some more general solution.. 123
  • 124. IBM patched JVM for GPU • Focused on CUDA (for now) • Focused on Stream API • Created their own .parallel() 124
  • 125. IBM patched JVM for GPU Imagine: void fooJava(float A[], float B[], int n) { // similar to for (idx = 0; i < n; i++) IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; }); } 125
  • 126. IBM patched JVM for GPU Imagine: void fooJava(float A[], float B[], int n) { // similar to for (idx = 0; i < n; i++) IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; }); } … we would like the lambda to be automatically converted to GPU code… 126
  • 127. IBM patched JVM for GPU When n is big the lambda code is executed on GPU: class Par { void foo(float[] a, float[] b, float[] c, int n) { IntStream.range(0, n).parallel() .forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); } } *only lambdas with primitive types in one dimension arrays. 127
  • 128. IBM patched JVM for GPU Optimized IBM JIT compiler: • Use read-only cache • Fewer writes to global GPU memory • Optimized Host to Device data copy rate • Fewer data to be copied • Eliminate exceptions as much as possible • In the GPU Kernel 128
  • 129. IBM patched JVM for GPU • Success story: + + 129
  • 130. IBM patched JVM for GPU • Officially: 130
  • 131. IBM patched JVM for GPU • More info: https://github.com/IBMSparkGPU/GPUEnabler 131
  • 132. But can we just write in Java, and its just being converted to OpenCL/CUDA? 132
  • 134. Aparapi is there for you! 134
  • 135. Aparapi • Short for «A PARallel API» 135
  • 136. Aparapi • Short for «A PARallel API» • Works like Hibernate for databases 136
  • 137. Aparapi • Short for «A PARallel API» • Works like Hibernate for databases • Dynamically converts JVM Bytecode to code for Host and Device 137
  • 138. Aparapi • Short for «A PARallel API» • Works like Hibernate for databases • Dynamically converts JVM Bytecode to code for Host and Device • OpenCL under the cover 138
  • 140. Aparapi • Started by AMD • Then abandoned… 140
  • 141. Aparapi • Started by AMD • Then abandoned… • In 5 years Opensourced under Apache 2.0 license 141
  • 142. Aparapi • Started by AMD • Then abandoned… • In 5 years Opensourced under Apache 2.0 license • Back to life!!! 142
  • 143. Aparapi – now its so much simple! public static void main(String[] _args) { final int size = 512; final float[] a = new float[size]; final float[] b = new float[size]; for (int i = 0; i < size; i++) { a[i] = (float) (Math.random() * 100); b[i] = (float) (Math.random() * 100); } final float[] sum = new float[size]; Kernel kernel = new Kernel(){ @Override public void run() { int gid = getGlobalId(); sum[gid] = a[gid] + b[gid]; } }; kernel.execute(Range.create(size)); for (int i = 0; i < size; i++) { System.out.printf("%6.2f + %6.2f = %8.2fn", a[i], b[i], sum[i]); } kernel.dispose(); } 143
  • 144. But what about the clouds? 144
  • 145. We can’t sell our product if its not cloud native! 145
  • 146. nVidia is your friend! 146
  • 147. nVidia GRID • Announced in 2012 • Already in production • Works on the most of the hypervisors • .. And in the clouds! 147
  • 150. … AMD is a bit behind… 150
  • 152. Its here: Nvidia GPU 152
  • 153. Its here : ATI Radeon 153
  • 154. Its here: AMD APU 154
  • 155. Its here: Intel Skylake 155
  • 156. Its here: Nvidia Tegra Parker 156
  • 160. So use it! If the task is suitable 160
  • 164. 164