The document summarizes Kazuaki Ishizaki's talk on making hardware accelerators easier to use. Some key points:
- Programs are becoming simpler while hardware is becoming more complicated, with commodity processors including hardware accelerators like GPUs.
- The speaker's recent work focuses on generating hardware accelerator code from high-level programs without needing specific hardware knowledge.
- An approach using a Java JIT compiler was presented that can generate optimized GPU code from parallel Java streams, requiring programmers to only express parallelism.
- The JIT compiler performs optimizations like aligning arrays, using read-only caches, reducing data transfer, and eliminating exception checks.
- Benchmarks show the generated GPU
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Making Hardware Accelerator Easier to Use
1. Invited talk at 14th Asian Symposium on Programming Languages and
Systems (APLAS 2016)
Kazuaki Ishizaki
IBM Research – Tokyo
Making Hardware Accelerator Easier to Use
1
2. Hanoi in 1996
My first visit to Hanoi
– I joined our research project “Java just-in-time compiler” on 1996
– I worked for Parallel Fortran compiler by 1995
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki2
3. Hanoi in 1996 and 2016
Drastically changed over twenty years
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki3
1996 2016
4. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
What has Happened in Computation-Intensive Area for 20 Years
Program is becoming simpler
Hardware is becoming complicated
1996 2016
Hardware Fast scalar processors Commodity processors with
hardware accelerators
Applications Weather, wind, fluid, and physics
simulations, chemical synthesis
Machine learning and
deep learning with big data
Program Complicated and
hardware-dependent code
Simple and clean code
(e.g. mapreduce)
Users Limited to programmers
who are well-educated for HPC
Data scientists
who are non-familiar with hardware
Hardware
Examples4
GPUPowerPC
5. What has Happened in Computation-Intensive Area for 20 Years
Program is becoming simpler
Hardware is becoming complicated
Making Hardware Accelerator Easier to Use5
1996 2016
Hardware Fast scalar processors Commodity processors with
hardware accelerators
Applications Weather, wind, fluid, and physics
simulations, chemical synthesis
Machine learning and
deep learning with big data
Program Complicated and
hardware- dependent code
Simple and clean code
(MapReduce)
Users Limited to programmers
who are well-educated for HPC
Data scientists
who are non-familiar with hardware
Hardware
Examples
Bad news:
Gap between hardware and software is bigger
Good news:
Program can be easily analyzed
6. My Recent Interest
How system generates hardware accelerator code from
program with high-level abstraction
– Expected (practical) result
People execute program without knowing usage of hardware accelerator
– Challenge
How to optimize code for a certain hardware accelerator without specific
information
–On-going research
GPU exploitation from Java program
GPU exploitation in Apache Spark
work with Akihiro Hayashi *, Alon Shalev Housfater -, Hiroshi Inoue +,
Madhusudanan Kandasamy , Gita Koblents -, Moriyoshi Ohara +,
Vivek Sarkar *, and Jan Wroblewski (intern) +
+ IBM Research – Tokyo, - IBM Canada, IBM India, * Rice University
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki6
8. Why Java for GPU Programming?
High productivity
– Safety and flexibility (compare to C/C++)
– Good program portability among different machines
“write once, run anywhere”
– One of the most popular programming languages
Hard to use CUDA and OpenCL for non-expert programmers
Many computation-intensive applications in non-HPC area
– Data analytics and data science (Hadoop, Spark, etc.)
– Security analysis
– Natural language processing
8 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
From https://www.flickr.com/photos/dlato/5530553658
CUDA is a programming language
for GPU offered by NVIDIA
9. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
How We Write GPU Program
Five steps
1. Allocate GPU device memory
2. Copy data on CPU main memory
to GPU device memory
3. Launch a GPU kernel to be executed
in parallel on cores
4. Copy back data on GPU device
memory to CPU main memory
5. Free GPU device memory
device memory
(up to 16GB)
main memory
(up to 1TB/socket)
CPU GPU
Data copy over
PCIe or NVLink
dozen cores/socket thousands cores
9
10. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
How We Optimize GPU Program
device memory
(up to 16GB)
main memory
(up to 1TB/socket)
CPU GPUdozen cores/socket thousands cores
10
Exploit faster memory
• Read-only cache (Read only)
• Shared memory (SMEM)
Data copy over
PCIe or NVLink
From GTC presentation by NVIDIA
Reduce data copy
Five steps
1. Allocate GPU device memory
2. Copy data on CPU main memory
to GPU device memory
3. Launch a GPU kernel to be executed
in parallel on cores
4. Copy back data on GPU device
memory to CPU main memory
5. Free GPU device memory
11. Fewer Code Makes GPU Programming Easy
Current programming model requires programmers to
explicitly write operations for
– managing device memories
– copying data
between CPU and GPU
– expressing parallelism
– exploiting faster memory
Java 8 enables programmers
to just focus on
– expressing parallelism
11 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
void fooCUDA(N, float *A, float *B, int N) {
int sizeN = N * sizeof(float);
cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN);
cudaMemcpy(d_A, A, sizeN, Host2Device);
GPU<<<N, 1>>>(d_A, d_B, N);
cudaMemcpy(B, d_B, sizeN, Device2Host);
cudaFree(d_B); cudaFree(d_A);
}
// code for GPU
__global__ void GPU(float* d_A, float* d_B, int N) {
int i = threadIdx.x;
if (N <= i) return;
d_B[i] = __ldg(&d_A[i]) * 2.0; //__ldg() for read-only cache
}
void fooJava(float A[], float B[], int N) {
// similar to for (idx = 0; i < N; i++)
IntStream.range(0, N).parallel().forEach(i -> {
B[i] = A[i] * 2.0;
});
}
12. Goal
Build a Java just-in-time (JIT) compiler to generate high
performance GPU code from a parallel loop construct
Implementing four performance optimizations
Offering performance evaluations on POWER8 with a GPU
Supporting Java language feature (See [PACT2015])
Predicting Performance on CPU and GPU [PPPJ2015]
Available in IBM Java 8 ppc64le and x86_64
– https://www.ibm.com/developerworks/java/jdk/java8/
12 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Accomplishments
13. Parallel Programming in Java 8
Express parallelism by using Parallel Stream API among
iterations of a lambda expression (index variable: i)
13 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
class Example {
void foo(float[] a, float[] b, float[] c, int n) {
java.util.Stream.IntStream.range(0, n).parallel().forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
}
}
Note: Current version supports one-dimensional arrays with primitive types in a lambda expression
14. Overview of Our JIT Compiler
Java bytecode
sequence is divided
into two intermediate
presentation (IR) parts
– Lambda expression:
generate GPU code
using NVIDIA tool chain
(right hand side)
– Others:
generate CPU code
using conventional JIT
compiler (left hand side)
14 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
NVIDIA GPU binary
for lambda expression
CPU binary for
- managing device memory
- copying data
- launching GPU binary
Conventional Java JIT compiler
Parallel stream APIs detection
// Parallel stream code
IntStream.range(0, n).parallel()
.forEach(i -> { ...c[i] = a[i]...});
IR for GPUs
...
c[i] = a[i]...
IR for CPUs
Java bytecode
CPU native code
generator GPU native code
generator (by NVIDIA)
Additional modules for GPU
GPUs optimizations
15. Optimizations for GPU in Our JIT Compiler
Optimizing alignment of Java arrays on GPUs
– Reduce # of memory transactions to a GPU global memory
Using read-only cache
– Reduce # of memory transactions to a GPU global memory
Optimizing data copy between CPU and GPU
– Reduce amount of data copy
Eliminating redundant exception checks
– Reduce # of instructions in GPU binary
15 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
16. Reducing # of Memory Transactions to GPU Global Memory
Aligning the starting address of an array body in GPU global
memory with memory transaction boundary
16 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
0 128
a[0]-a[31]
Object header
Memory address
a[32]-a[63]
Naive
alignment
strategy
a[0]-a[31] a[32]-a[63]
256 384
Our
alignment
strategy
One memory transaction for a[0:31]
Two memory transactions for a[0:31]
IntStream.range(0,n).parallel().
forEach(i->{
...= a[i]...; // a[] : float
...;
});
a[64]-a[95]
a[64]-a[95]
A 128-byte memory
transaction boundary
17. Reducing # of Memory Transactions to GPU Global Memory
Must keep only a read-only array in a read-only cache
– Lexically different variables (e.g. a[] and b[]) may point to the same array
that may be updated
Perform alias analysis to identify a read-only array by
– Static analysis in JIT compiler
identifies lexically read-only arrays and lexically written arrays
– Dynamic alias analysis in generated code
checks a lexically read-only array that may alias with any lexically written arrays
executes code with a read-only cache if not aliased
17 Compiling and Optimizing Java 8 Programs for GPU Execution
if (!(a[] aliases with b[]) && !(a[] aliases with c[])) {
IntStream.range(0, n).parallel().forEach( i -> {
b[i] = ROa[i] * 2.0; // use RO cache for a[]
c[i] = ROa[i] * 3.0; // use RO cache for a[]
});
} else {
// execute code w/o a read-only cache
}
IntStream.range(0,n).parallel().
forEach(i->{
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
18. Reducing Amount of Data Copy between CPU and GPU
Eliminate data copy from GPU if an array (e.g. a[]) is not
updated in GPU binary [Jablin11][Pai12]
Copy only a read or write set if an array index form is
‘i + constant’ (the set is contiguous)
18 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
sz = (n – 0) * sizeof(float)
cudaMemCopy(&a[0], d_a, sz, H2D); // copy only a read set
cudaMemCopy(&b[0], d_b, sz, H2D);
cudaMemCopy(&c[0], d_c, sz, H2D);
IntStream.range(0, n).parallel().forEach( i -> {
b[i] = a[i]...;
c[i] = a[i]...;
});
cudaMemcpy(a, d_a, sz, D2H);
cudaMemcpy(&b[0], d_b, sz, D2H); // copy only a write set
cudaMemcpy(&c[0], c_b, sz, D2H); // copy only a write set
19. Eliminating Redundant Exception Checks
Generate GPU code without exception checks by using
– loop versioning [Artigas00] that guarantees safe region by using pre-
condition checks on CPU
19 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
if (
// check cond. for NullPointerException
a != null && b != null && c != null &&
// check cond. for ArrayIndexOutOfBoundsException
a.length <l n && b.length <l n && c.length <l n) {
...
<<<...>>> GPUbinary(...)
...
} else {
// execute this construct on CPU
// to produce an exception
// under the original exception semantics
}
GPU binary for {
// safe region:
// no exception
// check is required
i = ...;
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
}
IntStream.range(0,n).parallel().
forEach(i->{
b[i] = a[i]...;
c[i] = a[i]...;
});
20. Automatically Optimized for CPU and GPU
CPU code
– handles GPU device memory management and data copying
– checks whether optimized CPU and GPU code can be executed
GPU code
is optimized
– Using
read-only
cache
– Eliminating
exception
checks
20 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
if (a != null && b != null && c != null &&
a.length < n && b.length < n && c.length < n &&
!(a[] aliases with b[]) && !(a[] aliases with c[])) {
cudaMalloc(d_a, a.length*sizeof(float)+128);
if (b!=a) cudaMalloc(d_b, b.length*sizeof(float)+128);
if (c!=a && c!=b) cudaMalloc(d_c, c.length*sizeof(float)+128);
int sz = (n – 0) * sizeof(float), szh = sz + Jhdrsz;
cudaMemCopy(a, d_a + align - Jhdrsz, szh, H2D);
<<...>> GPU(d_a, d_b, d_c, n) // launch GPU
cudaMemcpy(b + Jhdrsz, d_b + align, sz, D2H);
cudaMemcpy(c + Jhdrsz, d_c + align, sz, D2H);
cudaFree(d_a);
if (b!=a) cudaFree(d_b);
if (c=!a && c!=b) cudaFree(d_c);
} else {
// execute CPU binary
}
__global__ void GPU(float *a,
float *b, float *c, int n) {
// no exception checks
i = ...
b[i] = ROa[i] * 2.0;
c[i] = ROa[i] * 3.0;
}
CPU
GPU
21. Benchmark Programs
Prepare sequential and parallel stream API versions in Java
21 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Name Summary Data size Type
Blackscholes Financial application that calculates the price of put and call
options
4,194,304 virtual
options
double
MM A standard dense matrix multiplication: C = A.B 1,024 x 1,024 double
Crypt Cryptographic application [Java Grande Benchmarks] N = 50,000,000 byte
Series the first N fourier coefficients of the function [Java Grande
Benchamark]
N = 1,000,000 double
SpMM Sparse matrix multiplication [Java Grande Benchmarks] N = 500,000 double
MRIQ 3D image benchmark for MRI [Parboil benchmarks] 64x64x64 float
Gemm Matrix multiplication: C = α.A.B + β.C [PolyBench] 1,024 x 1,024 int
Gesummv Scalar, vector, and Matrix multiplication [PolyBench] 4,096 x 4,096 int
22. Performance Improvements of GPU Version
Over Sequential and Parallel CPU Versions
Achieve 127.9x on geomean and 2067.7x for Series over 1 CPU thread
Achieve 3.3x on geomean and 32.8x for Series over 160 CPU threads
Degrade performance for SpMM and Gesummv against 160 CPU threads
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki22
Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memory
with one NVIDIA Kepler K40m GPU at 876 MHz with 12-GB global memory (ECC off)
Ubuntu 14.10, CUDA 5.5
Modified IBM Java 8 runtime for PowerPC
23. 0.85
0.45
1.51
0.92
0.74
0.11
1.19
3.47
0
0.5
1
1.5
2
2.5
3
3.5
4
BlackScholes MM Crypt Series SpMM MRIQ Gemm Gesummv
SpeeduprelativetoCUDA
Performance Comparison with Hand-Coded CUDA
Achieve 0.83x on geomean over CUDA
Crypt, Gemm, and Gesummv: usage of a read-only cache
BlackScholes: usage of larger CUDA threads per block (1024 vs. 128)
SpMM: overhead of exception checks
MRIQ: miss of ‘-use-fast-math’ compile option
MM: lack of usage of shared memory with loop tiling
23 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Higher is better
24. GPU Version is slower Than Parallel CPU Version
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki24
Can we choose an appropriate device (CPU or GPU) to avoid
performance degradation?
– Want to make sure to achieve equal or better performance
25. Machine-learning-based Performance Heuristics
Construct a binary prediction model offline by supervised
machine learning with support vector machines (SVMs)
– Features
Loop range
Dynamic number of instructions (memory access, arithmetic operation, …)
Dynamic number of array accesses (a[i], a[i + c], a[c * i], a[idx[i]])
Data transfer size (CPU to GPU, GPU to CPU)
25 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
data1
Bytecode
App A feature 1
Features
extraction
LIBSVM Java
Runtime
Prediction
Model
data2
Bytecode
App A feature 2
Features
extraction
data3
Bytecode
App B feature 3
Features
extraction
CPU GPU
26. Most Predictions are Correct
Use 291 cases to build model
Succeeded in predicting cases of performance degradations on GPU
Failed to predict BlackScholes
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki26
Prediction
1.8->1.0 0.8->1.0 0.4->1.0
27. Related Work
Our research enables memory and communication
optimizations with machine-learning-based device selection
27 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Work Language Exception
support
JIT
compiler
How to write GPU kernel Data copy
optimization
GPU memory
optimization
Device selection
JCUDA Java × × CUDA Manual Manual GPU only
JaBEE Java × √ Override run method × × GPU only
Aparapi Java × √
Override run
method/Lambda
× × Static
Hadoop-CL Java × √
Override map/reduce
method
× × Static
Rootbeer Java × √ Override run method Not described × Not described
[PPPJ09] Java √ √ Java for-loop Not described ×
Dynamic with
regression
HJ-OpenCL
Habanero-
Java
√ √ Forall constructs √ × Static
Our work Java √ √
Standard parallel
stream API
√
ROCache /
alignment
Dynamic with
machine learning
28. Future work
Exploiting shared memory
– Like private memory shared by 64 - 192 cores
Supporting additional Java operations
28 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
30. What is Apache Spark?
Framework that processes distributed computing by transforming
distributed immutable memory structure using set of parallel operations
e.g. map(), filter(), reduce(), …
– Distributed immutable in-memory structures
RDD (Resilient Distributed Dataset), DataFrame, Dataset
– Scala is primary language for programming on Spark
Provide domain specific libraries
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Spark Runtime (written in Java and Scala)
Spark
Streaming
(real-time)
GraphX
(graph)
SparkSQL
(SQL)
MLlib
(machine
learning)
Java Virtual Machine
tasks Executor
Driver
Executor
results
Executor
Data
Data
Data
Open source: http://spark.apache.org/
Data Source (HDFS, DB, File, etc.)
Latest version is 2.0.3 released in 2016/11
30
31. How Program Works on Apache Spark
Parallel operations can be executed among partitions
In a partition, data can be processed sequentially
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
case class Pt(x: Int, y: Int)
val ds1: Dataset[Pt] = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDS
val ds2: Dataset[Pt] = ds1.map(p => Pt(p.x+1, p.y*2))
val cnt: Int = ds2.reduce((p1, p2) => p1.x + p2.x)
ds1 ds2
p.x+1
p.y*2
p1.x + p2.x
9
5
14
partition
partition
cnt
54
32
+ =
+ =1 5
2 6
partition
pt
partition
31
2 10
3 12
3 7
4 8
4 14
5 16
32. How We Can Run Program Faster on GPU
Assign many parallel computations into cores
Make memory accesses coalesce
– Column-oriented layout results in better performance
[Che2011] reports on about 3x performance improvement of GPU kernel execution of
kmeans with column-oriented layout over row-oriented layout
1 52 61 5 3 7
Assumption: 4 consecutive data elements
can be coalesced using GPU hardware
2 v.s. 4
memory accesses to
GPU device memory
Row-oriented layoutColumn-oriented layout
Pt(x: Int, y: Int)
Pt(1,5), Pt(2,6), Pt(3,7), Pt(4,8)
Load four Pt.x
Load four Pt.y
2 6 4 843 87
cores
x1 x2 x3 x4
cores
Load Pt.x Load Pt.y Load Pt.x Load Pt.y
1 2 31 2 4
y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki32
33. Idea to Transparently Exploit GPUs on Apache Spark
Generate GPU code from a set of parallel operations
– Made it in another research already
Physically put distributed immutable in-memory structures
(e.g. Dataset) in column-oriented representation
– Dataset is statically typed, but physical layout is not specified in program
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki33
34. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Overview of GPU Exploitation on Apache Spark
User’s Spark Program
case class Pt(x: Int, y: Int)
ds1 = sc.parallelize(Seq(
Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2)
.toDS
ds2 = ds1.map(p => Pt(p.x+1, p.y*2))
cnt = ds2.reduce((p1, p2) => p1.x + p2.x)
Nativecode
GPU
10
12
14
+1=
*2=
ds1
Data
transfer
x y x y
ds2
partition
GPU
kernel
CPU
16
2
3
4
5
10
12
14
16
2
3
4
5
5
6
1
2
7
8
3
4
5
6
1
2
7
8
3
4
34
35. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Overview of GPU Exploitation on Apache Spark
Efficient
– Reduce data copy overhead between CPU and GPU
– Make memory accesses efficient on GPU
Transparent
– Map parallelism in program
into GPU native code
User’s Spark Program
case class Pt(x: Int, y: Int)
ds1 = sc.parallelize(Seq(
Pt(1, 4), Pt(2, 5), Pt(3, 6), Pt(4, 7)), 2)
.toDS
ds2 = ds1.map(p => Pt(p.x+1, p.y*2))
cnt = ds2.reduce((p1, p2) => p1.x + p2.x)
Drive
GPU native
code
Nativecode
GPU
+1=
*2=
ds1
Data
transfer
x y
GPU manager
Columnar storage
x y
GPU can exploit parallelism both
among partitions in Dataset and
within a partition of Dataset
ds2
partition
GPU
kernel
CPU
Memoryaddress
35
10
12
14
16
2
3
4
5
10
12
14
16
2
3
4
5
5
6
1
2
7
8
3
4
5
6
1
2
7
8
3
4
36. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
How We Write Program And What is Executed
Write an operation using a lambda expression for RDD. Then, the corresponding Java
bytecode for the expression is executed.
Write a program using a relational operation for DataFrame or a lambda expression for
Dataset. Catalyst performs optimization and code generation for the program. Then, the
corresponding Java bytecode for the generated Java code is executed.
ds1 = data.toDS()
ds2 = ds2.map(p => p.x+1)
ds2.reduce((a,b) => a+b)
rdd1 = sc.parallelize(data)
rdd2 = rdd1.map(p => p.x+1)
rdd2.reduce((a,b) => a+b)
df1 = data.toDF(…)
df2 = df2.selectExpr("x+1")
df2.agg(sum())
Frontend
API
DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)
Backend
computationCatalyst-generated Java bytecode Scalac-generated Java bytecode
Java code
Catalyst
1 5 2 6
Java heap
2 61 5
Java heap
Row-oriented Row-oriented Data
data =
Seq(Pt(1, 5),Pt(2, 6))
36
37. Our Two Implementations for GPU Exploitation
GPUEnabler is designed for writing domain specific libraries by a Ninja
programmer
– Transparent exploitation by calling a method in the library
Enhanced Catalyst is designed for writing application by general
programmer
– Transparent exploitation by automatic code generation
Code / Columnar Storage DataFrame/Dataset RDD
Hand-written code GPUEnabler
Automatic code generation Enhanced Catalyst
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
GPU manager
Generated
GPU/SIMD
Pre-compiled
GPU/SIMD code
Spark user program
Columnar storage
Spark runtime
37
38. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
How Program is Executed on GPU
For RDD, a programmer provides a pre-compiled GPU code. GPUEnabler handles
data transfer between GPU and CPU and launches the GPU code on GPU
For DataFrame and Dataset, enhanced Catalyst generates Java code optimized for
GPU. A just-in-time compiler in Java virtual machine can generate GPU code.
ds1 = data.toDS()
ds2 = ds2.map(p => p.x+1)
ds2.reduce((a,b) => a+b)
rdd1 = sc.parallelize(data)
rdd2 = rdd1.map(p => p.x+1, gpu)
rdd2.reduce((a,b) => a+b, gpu)
df1 = data.toDF(…)
df2 = df2.selectExpr("x+1")
df2.agg(sum())
Frontend
API
DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)
Backend
computationAutomatically generated GPU code Pre-compiled GPU code
Optimized Java code
Enhanced Catalyst
Data2 61 5
GPU device memory
Column-oriented 2 61 5
GPU device memory
Column-oriented
data =
Seq(Pt(1, 5),Pt(2, 6))
GPU Enabler
38
39. GPUEnabler (https://github.com/IBMSparkGPU/GPUEnabler)
Use columnar storage for RDD
Support map & reduce operations to drive GPU code
Pass GPU code provided by programmer
to an argument of map()/reduce()
Implemented as a plug-in
# bin/spark-shell --class your.gpu.application yours.jar
--packages com.ibm:gpu-enabler_2.10:1.0.0
// Load a kernel function from the GPU kernel binary
val ptxURL = SparkGPULR.getClass.getResource("/GpuEnablerExamples.ptx")
val mapFunction = new CUDAFunction("multiply2", Array("this"), Array("this"),ptxURL)
val reduceFunction = new CUDAFunction(“sum”, Array(“this”), Array(“this”), ptxURL)
val rdd = sc.parallelize(1 to n)
val output = rdd
.mapExtFunc((x: Int) => x * 2, mapFunction)
.reduceExtFunc((x: Int, y: Int) => x + y, reduceFunction)
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
// GPU code
__global__ void multiply2(int *inX, int *outX, long *size) {
long ix = threadIdx.x + blockIdx.x * blockDim.x;
if (*size <= ix) return;
outX[ix] = inX[ix] * 2;
}
PTX is a common instruction set among
different NVIDIA GPUs defined by NVIDIA
39
40. Pseudo Java Code by Current Catalyst
Perform optimization that merges multiple parallel operations
(selectExpr() and agg(sum()) into one loop
int sum = 0
while (rowIterator.hasNext()) {
Row row = rowIterator.next(); // for df1
int x = row.getInteger(0);
// selectExpr(x + 1)
int x_new = x + 1; // for df2
sum += x_new;
}
val df1 = (-1 to 1).toDF("x")
val df2 = df1.selectExpr("x + 1")
df2.agg(sum())
Generated code corresponds to selectExpr() and local sum()
1
3
1
0
-1
-1 0
DataFrame program for Spark
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
20 1
Read sequentially
40
df1
x
x_new
sum
Row-oriented
Catalyst
Generated pseudo Java code
41. Pseudo Java Code by Enhanced Catalyst
Get column0 from column-oriented storage
For-loop can be executed in a reduction manner
Column column0 = df1.getColumn(0); // df1
int sum = 0;
for (int i = 0; i < column0.numRows; i++) {
int x = column0.getInteger(i);
// selectExpr(x + 1)
int x_new = x + 1; // for df2
sum += x_new;
}
1
10-1
-1 0
Generated pseudo Java code
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
3
20 1
41
df1
x
x_new
sum
Column-orientedEnhanced Catalyst
42. Generate GPU Code Transparently from Spark Program
Copy column-oriented storage into GPU
Execute add and reduction in one GPU kernel
Column column0 = df1.getColumn(0);
int nRows = column0.numRows;
cudaMalloc(&d_c0, nRows*4);
cudaMemcpy(d_c0, column0, nRows, H2D);
int sum = 0;
cudaMalloc(&d_sum, 4);
cudaMemcpy(d_c0, &sum, 4, H2D);
<<...>> GPU(d_c0, d_sum, nRows) // launch GPU
cudaMemcpy(d_c0, &sum, 4, D2H);
cudaFree(d_sum); cudaFree(d_c0);
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
val df1 = (-1 to 1).toDF("x")
val df2 = df1.selectExpr("x + 1")
df2.agg(sum())
// GPU code
__global__ void GPU(
int *d_c0, int *d_sum, long size) {
long ix = … // 0, 1, 2
if (size <= ix) return;
int x = d_c0[ix];
int x_new = x + 1;
reduction(d_sum, x_new);
}
42
1
10-1
-1 0
3
20 1
x
x_new
d_sum
d_c0
Execute in parallel
Generated CPU code
43. Many Engineering Efforts are Required
Make DataFrame and Dataset column-oriented storage
Generate simpler optimized code in the while-loop
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki43
44. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Very Complicated Java Code by Current Catalyst
Overhead exists in Java code
– Data representation
– Data conversions
– Complicated code
// source program
val x = Array(1.0, 2.0), y = Array(3.0, 4.0)
val ds = sparkContext.parallelize(Seq(x, y), 1).toDS
ds.map(a => a)
44
a. sparse array to
java.lang.Double[]
b. java.lang.Double[] to double[]
c. double[] to java.lang.Double[]
d. java.lang.Double[] to sparse array
45. Pretty Simple Java Code by Enhanced Catalyst
Eliminated most of data conversions are eliminated
Use data representations suitable for GPU
Dense array to double[]
double[] to dense array
// source program
val x = Array(1.0, 2.0), y = Array(3.0, 4.0)
val ds = sparkContext.parallelize(Seq(x, y), 1).toDS
ds.map(a => a)
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki45
46. Related Work
SparkJNI (https://github.com/tudorv91/SparkJNI)
– Call native method from map() or reduce()
– Very similar to GPUEnabler. However, use no columnar storage
Spark With Accelerated Tasks [Grossman2016]
– Generate GPU code from lambda function in map() in RDD
– Very similar to enhanced Catalyst using columnar storage to transparently
exploit GPUs. However, work for RDD with map()
GPU Columnar (proposed by Kiran Lonikar)
– Generate GPU code from program using select() method in DataFrame
– Very similar to enhanced Catalyst using columnar storage to transparently
exploit GPUs
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
val inputRDD = cl(sc.objectFile[Int]( hdfsPath ))
val doubledRDD = inputRDD.map(i => 2 * i)
JavaRDD<VectorBean> vectorsRdd = getSparkContext().parallelize(generateVectors(2, 4));
JavaRDD<VectorBean> mulResults = vectorsRdd.map(new VectorMulJni(libPath, "mapVectorMul"));
VectorBean results = mulResults.reduce(new VectorAddJni(libPath, "reduceVectorAdd"));
46
47. Conclusion
We generated hardware accelerator code from program with
high-level abstraction
It is not easy to make them in systematic way
– How can we easily generate optimized code from different types of
domain specific languages?
Program is cleaner and simpler than twenty-years ago.
How can we integrate good results in theory into practical system?
What can we do similar things for deep learning?
– Current deep learning frameworks use GPU by calling libraries (e.g.
cnDNN/cuRNN)
– What are future programming models for deep learning?
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki47