6. What is a video card?
A video card (also called a display card, graphics card, display
adapter or graphics adapter) is an expansion card which generates a
feed of output images to a display (such as a computer monitor).
Frequently, these are advertised as discrete or dedicated graphics
cards, emphasizing the distinction between these and integrated
graphics.
6
7. What is a video card?
But as for today:
Video cards are not limited to simple image output, they have a built-in
graphics processor that can perform additional processing, removing
this task from the central processor of the computer.
7
11. What is a GPU?
• Graphics Processing Unit
• First used by Nvidia in 1999
11
12. What is a GPU?
• Graphics Processing Unit
• First used by Nvidia in 1999
• GeForce 256 is called as «The world’s first GPU»
12
13. What is a GPU?
• Defined as “single-chip processor with integrated transform, lighting,
triangle setup/clipping, and rendering engines capable of processing
of 10000 polygons per second”
13
14. What is a GPU?
• Defined as “single-chip processor with integrated transform, lighting,
triangle setup/clipping, and rendering engines capable of processing
of 10000 polygons per second”
• ATI called them VPU..
14
18. GPGPU
• General-purpose computing on graphics processing units
• Performs not only graphic calculations..
• … but also those usually performed on CPU
18
32. It all started with a shader
• Cool video cards were able to offload some of the tasks from the CPU
32
33. It all started with a shader
• Cool video cards were able to offload some of the tasks from the CPU
• But the most of the algorithms we just “hardcoded”
33
34. It all started with a shader
• Cool video cards were able to offload some of the tasks from the CPU
• But the most of the algorithms we just “hardcoded”
• They were considered “standard”
34
35. It all started with a shader
• Cool video cards were able to offload some of the tasks from the CPU
• But the most of the algorithms we just “hardcoded”
• They were considered “standard”
• Developers were able just to call them
35
36. It all started with a shader
• But its obvious, not everything can be done with “hardcoded” algorithms
36
37. It all started with a shader
• But its obvious, not everything can be done with “hardcoded” algorithms
• That’s why some of the vendors “opened access” for developers to use their own
algorithms with own programs
37
38. It all started with a shader
• But its obvious, not everything can be done with “hardcoded” algorithms
• That’s why some of the vendors “opened access” for developers to use their own
algorithms with own programs
• These programs are called Shaders
38
39. It all started with a shader
• But its obvious, not everything can be done with “hardcoded” algorithms
• That’s why some of the vendors “opened access” for developers to use their own
algorithms with own programs
• These programs are called Shaders
• From this moment video card could work on transformations, geometry and
textures as the developers want!
39
40. It all started with a shader
• First shadres were different:
• Vertex
• Geometry
• Pixel
• Then they were united to Common Shader Architecture
40
41. There are several shaders languages
• RenderMan
• OSL
• GLSL
• Cg
• DirectX ASM
• HLSL
• …
41
46. Several abstractions were created:
• OpenGL
• is a cross-language, cross-platform application programming interface (API) for
rendering 2D and 3Dvector graphics. The API is typically used to interact with
a graphics processing unit (GPU), to achieve hardware-accelerated rendering.
• Silicon Graphics Inc., (SGI) started developing OpenGL in 1991 and released it in
January 1992;
• DirectX
• is a collection of application programming interfaces (APIs) for handling tasks related
to multimedia, especially game programming and video, on Microsoft platforms.
Originally, the names of these APIs all began with Direct, such
as Direct3D, DirectDraw, DirectMusic, DirectPlay, DirectSound, and so forth. The
name DirectX was coined as a shorthand term for all of these APIs (the X standing in
for the particular API names) and soon became the name of the collection.
46
51. But somewhere in 2005 it was
finally realized this can be used
for general computations as well
51
52. BrookGPU
• Early efforts to use GPGPU
• Own subset of ANSI C
• Brook Streaming Language
• Made in Stanford University
52
53. GPGPU
• CUDA — Nvidia C subset proprietary platform.
• DirectCompute — Microsoft proprietary shader language, part of
Direct3d, starting from DirectX 10.
• AMD FireStream — ATI proprietary technology.
• OpenACC – multivendor consortium
• C++ AMP – Microsoft proprietary language
• OpenCL – Common standard controlled by Kronos group.
53
54. Why should we ever use GPU on Java
• Why Java
• Safe and secure
• Portability (“write once, run everywhere”)
• Used on 3 000 000 000 devices
54
55. Why should we ever use GPU on Java
• Why Java
• Safe and secure
• Portability (“write once, run everywhere”)
• Used on 3 000 000 000 devices
• Where can we apply GPU
• Data Analytics and Data Science (Hadoop, Spark …)
• Security analytics (log processing)
• Finance/Banking
55
66. What’s that?
• Short for Open Compute Language
• Consortium of Apple, nVidia, AMD, IBM, Intel, ARM, Motorola and
many more
• Very abstract model
• Works both on GPU and CPU
66
70. All in all it works like this:
HOST DEVICE
Result
70
71. Typical lifecycle of an OpenCL app
• Create context
• Create command queue
• Create memory buffers/fill with data
• Create program from sources/load binaries
• Compile (if required)
• Create kernel from the program
• Supply kernel arguments
• Define ND range
• Execute
• Return resulting data
• Release resources
71
88. Execution model
• We’ve got a lot of data
• We need to perform the same computations over them
• So we can just shard them
• OpenCL is here t help us
88
99. And what about CUDA?
Well.. It looks to be easier
for C developers…
99
100. CUDA kernel
#define N 10
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x; // this thread handles the data at its thread id
if (tid < N)
c[tid] = a[tid] + b[tid];
}
100
101. CUDA setup
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
// allocate the memory on the GPU
cudaMalloc( (void**)&dev_a, N * sizeof(int) );
cudaMalloc( (void**)&dev_b, N * sizeof(int) );
cudaMalloc( (void**)&dev_c, N * sizeof(int) );
// fill the arrays 'a' and 'b' on the CPU
for (int i=0; i<N; i++) {
a[i] = -i;
b[i] = i * i;
}
101
102. CUDA copy to memory and run
// copy the arrays 'a' and 'b' to the GPU
cudaMemcpy(dev_a, a, N *sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,b,N*sizeof(int),
cudaMemcpyHostToDevice);
add<<<N,1>>>(dev_a,dev_b,dev_c);
// copy the array 'c' back from the GPU to the CPU
cudaMemcpy(c,dev_c,N*sizeof(int),
cudaMemcpyDeviceToHost);
102
103. CUDA get results
// display the results
for (int i=0; i<N; i++) {
printf( "%d + %d = %dn", a[i], b[i], c[i] );
}
// free the memory allocated on the GPU
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
103
104. But CUDA has some other superpowers
• Cublas – all about matrices
• JCufft – Fast Frontier Transformation
• Jcurand – all about random
• JCusparse – sparse matrices
• Jcusolver – factorization and some other crazy stuff
• Jnvgraph – all about graphs
• Jcudpp – CUDA Data Parallel Primitives Library, and some sorting
• JNpp – image processing on GPU
• Jcudnn – Deep Neural Network library (that’s scary)
104
105. For example we need a good rand
int n = 100;
curandGenerator generator = new curandGenerator();
float hostData[] = new float[n];
Pointer deviceData = new Pointer();
cudaMalloc(deviceData, n * Sizeof.FLOAT);
curandCreateGenerator(generator, CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(generator, 1234);
curandGenerateUniform(generator, deviceData, n);
cudaMemcpy(Pointer.to(hostData), deviceData,
n * Sizeof.FLOAT, cudaMemcpyDeviceToHost);
System.out.println(Arrays.toString(hostData));
curandDestroyGenerator(generator);
cudaFree(deviceData);
105
106. For example we need a good rand
• With a strong theory underneath
• Developed by Russian mathematician Ilya Sobolev back in 1967
• https://en.wikipedia.org/wiki/Sobol_sequence
106
123. But if we want some more
general solution..
123
124. IBM patched JVM for GPU
• Focused on CUDA (for now)
• Focused on Stream API
• Created their own .parallel()
124
125. IBM patched JVM for GPU
Imagine:
void fooJava(float A[], float B[], int n) {
// similar to for (idx = 0; i < n; i++)
IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; });
}
125
126. IBM patched JVM for GPU
Imagine:
void fooJava(float A[], float B[], int n) {
// similar to for (idx = 0; i < n; i++)
IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; });
}
… we would like the lambda to be automatically converted to GPU code…
126
127. IBM patched JVM for GPU
When n is big the lambda code is executed on GPU:
class Par {
void foo(float[] a, float[] b, float[] c, int n) {
IntStream.range(0, n).parallel()
.forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
}
}
*only lambdas with primitive types in one dimension arrays.
127
128. IBM patched JVM for GPU
Optimized IBM JIT compiler:
• Use read-only cache
• Fewer writes to global GPU memory
• Optimized Host to Device data copy rate
• Fewer data to be copied
• Eliminate exceptions as much as possible
• In the GPU Kernel
128
136. Aparapi
• Short for «A PARallel API»
• Works like Hibernate for databases
136
137. Aparapi
• Short for «A PARallel API»
• Works like Hibernate for databases
• Dynamically converts JVM Bytecode to code for Host and Device
137
138. Aparapi
• Short for «A PARallel API»
• Works like Hibernate for databases
• Dynamically converts JVM Bytecode to code for Host and Device
• OpenCL under the cover
138
141. Aparapi
• Started by AMD
• Then abandoned…
• In 5 years Opensourced under Apache 2.0 license
141
142. Aparapi
• Started by AMD
• Then abandoned…
• In 5 years Opensourced under Apache 2.0 license
• Back to life!!!
142
143. Aparapi – now its so much simple!
public static void main(String[] _args) {
final int size = 512;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
final float[] sum = new float[size];
Kernel kernel = new Kernel(){
@Override public void run() {
int gid = getGlobalId();
sum[gid] = a[gid] + b[gid];
}
};
kernel.execute(Range.create(size));
for (int i = 0; i < size; i++) {
System.out.printf("%6.2f + %6.2f = %8.2fn", a[i], b[i], sum[i]);
}
kernel.dispose();
}
143