Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection

Machine-Learning-based
Performance Heuristics
for Runtime CPU/GPU Selection
Akihiro Hayashi (Rice University)
Kazuaki Ishizaki (IBM Research - Tokyo)
Gita Koblents (IBM Canada)
Vivek Sarkar(Rice University)
1
ACM International Conference on Principles and Practices of
Programming on the Java Platform: virtual machines, languages, and tools (PPPJ’15)

Background:
Java 8 Parallel Streams APIs
Explicit Parallelism with lambda
expressions
2
IntStream.range(0, N)
.parallel()
.forEach(i ->
<lambda>)

Background:
Explicit Parallelism with Java
High-level parallel programming with
Java offers opportunities for
preserving portability
enabling compiler to perform parallel-
aware optimizations and code generation
3
Java 8 Parallel Stream API
Multi-
core
CPUs
Many-
core
GPUs
FPGAs
Java 8 Programs
HW
SW

Background:
JIT Compilation for GPU Execution
 IBM Java 8 Compiler
Built on top of the production version of
the IBM Java 8 runtime environment
4
Multi-
core
CPUs
Many-
core
GPUs
method A
method A
method A
method A
Interpretation on
JVM
1st invocation
2nd invocation
Nth invocation
(N+1)th invocation
Native Code
Generation for
Multi-
core
CPUs

Background:
The compilation flow of IBM Java 8 Compiler
5
Java
bytecode
parallel streams
identification in IR
Target machine
code generation
PTX2binary
module
NVIDIA GPU
native code
PowerPC
native code
libnvvm
Our
IR
PTX
JIT compiler Module by NVIDIA
Analysis and
optimizations
Our new modules for GPU
IR for
parallel streams
NVVM IR
generation
NVVM
IR
Existing
optimizations
Bytecode
translation
Runtime Helpers
GPU
Feature
extraction
Optimizations for GPUs
To improve performance
Read Only Cache Utilization / Buffer Alignment
Data Transfer Optimizations
To support Java’s language features
Loop versioning for eliminating redundant
exception checking on GPUs
Virtual method invocation support with
de-virtualization and loop versioning

Motivation:
Runtime CPU/GPU Selection
Selecting a faster hardware device
is a challenging problem
6
Multi-
core
CPUs
Many-
core
GPUs
method A
method A
Nth invocation
(N+1)th invocation
Native Code
Generation for
(Problem)
Which one is
faster?

0e+00 4e+07 8e+07
0.00.20.40.60.81.0
The dynamic number of IR instructions
KernelExecutionTime(msec)
NVIDIA Tesla K40 GPU
IBM POWER8 CPU
Related Work:
Linear regression
Regression based cost estimation
[1,2] is specific to an application
7
App 1) BlackScholes
[1] Leung et al. Automatic Parallelization for Graphics Processing Units (PPPJ’09)
[2] Kerr et al. Modeling GPU-CPU Workloads and Systems (GPGPU-3)
ExecutionTime(msec)
0e+00 4e+07 8e+07
01234
The dynamic number of IR instructions
KernelExecutionTime(msec)
NVIDIA Tesla K40 GPU
IBM POWER8 CPU
ExecutionTime(msec) App 2) Vector Addition
CPU
GPU
GPU
CPU

Open Question:
Accurate cost model is required?
Accurate cost model construction
would be too much
Considerable effort will be needed to
update performance models for future
generations of hardware
8
Multi-
core
CPUs
Many-
core
GPUs
Which one
is faster?
Machine-learning-based performance heuristics

Our Approach:
ML-based Performance Heuristics
A binary prediction model is constructed by
supervised machine learning with support
vector machines (SVMs)
9
bytecode
App A
Prediction
Model
JIT compiler
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
LIBSVM
Training run with JIT Compiler Offline Model Construction
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
CPU
GPU

Features of program
that may affect performance
Loop Range (Parallel Loop Size)
The dynamic number of Instructions
Memory Access
Arithmetic operations
Math Methods
Branch Instructions
Other Instructions
10

Features of program
that may affect performance (Cont’d)
The dynamic number of Array
Accesses
Coalesced Access (a[i]) (aligned access)
Offset Access (a[i+c])
Stride Access (a[c*i])
Other Access (a[b[i]])
Data Transfer Size
H2D Transfer Size
D2H Transfer Size
11
0 5 10 15 20 25 30
050100150200
c : offset or stride size
Bandwidth(GB/s)
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ●
●
●
Offset : array[i+c]
Stride : array[c*i]

An example of feature vector
12
{
"title" :"MRIQ.runGPULambda()V",
"lineNo" :114,
"features" :{
"range": 32768,
"IRs" :{
"Memory": 89128, "Arithmetic": 61447, "Math": 6144,
"Branch": 3074, "Other": 58384
},
"Array Accesses" :{
"Coalesced": 9218, "Offset": 0,
"Stride": 0, "Other": 12288 },
"H2D Transfer" :
[131088,131088,131088,12304,12304,12304,12304,16,16],
"D2H Transfer" :
[131072,131072,0,0,0,0,0,0,0,0]
},
}

Applications
13
Application Source Field Max Size Data Type
BlackScholes Finance 4,194,304 double
Crypt JGF Cryptography Size C (N=50M) byte
SpMM JGF
Numerical
Computing
Size C (N=500K) double
MRIQ Parboil Medical Large (64^3) float
Gemm Polybench
Numerical
Computing
2K x 2K int
Gesummv Polybench 2K x 2K int
Doitgen Polybench 256x256x256 int
Jacobi-1D Polybench N=4M, T=1 int
Matrix
Multiplication
2K x 2K double
Matrix Transpose 2K x 2K double
VecAdd 4M double

Platform
CPU
IBM POWER8 @ 3.69GHz
20-cores
8 SMT threads per cores
= up to 160 threads
256 GB of RAM
GPU
NVIDIA Tesla K40m
12GB of Global Memory
14

Prediction Model Construction
Obtained 291 samples by running 11
applications with different data sets
Choice is either GPU or 160 worker
threads on CPU
15
bytecode
App A
Prediction
Model
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
LIBSVM
3.2
Training run with JIT Compiler Offline Model Construction
feature
extraction
feature
extraction
feature
extraction
Java
Runtime

Speedups and the accuracy with the max
data size: 160 worker threads vs. GPU
16
40.6 37.4
82.0
64.2
27.6
1.4 1.0
4.4
36.7
7.4 5.7
42.7 34.6
58.1
844.7 772.3
1.0
0.1
1.9
1164.8
9.0
1.2
0.0
0.1
1.0
10.0
100.0
1000.0
10000.0
SpeeduprelativetoSEQENTIALJava
(logscale)
Higher is better160 worker threads (Fork/join) GPU
Prediction x ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔

How to evaluate prediction models:
TRUE/FALSE POSITIVE/NEGATIVE
4 types of binary prediction results
(Let POSITIVE is CPU, NEGATIVE is GPU)
TRUE POSITIVE
Correctly predicted that CPU is faster
TRUE NEGATIVE
Correctly predicted that GPU is faster
FALSE POSITIVE
Predicted CPU is faster, but GPU is actually faster
FALSE NEGATIVE
Predicted GPU is faster, but CPU is actually faster
17

Accuracy, Precision, and Recall metrics
Accuracy
 the percentage of selections predicted correctly:
 (TP + TN) / (TP + TN + FP + FN)
Precision X (X = CPU or GPU)
 Precision CPU : TP / (TP + FP)
# of samples correctly predicted CPU is faster
/ total # of samples predicted that CPU is faster
 Precision GPU : TN / (TN + FN)
Recall X (X = CPU or GPU)
 Recall CPU : TP / (TP + FN)
# of samples correctly predicted CPU is faster
/ total # of samples labeled that CPU is actually faster
 Recall GPU : TN / (TN + FP)
18
How precise is the model
when it predicts X is
faster.
How does the prediction
hit the nail when X is
actually faster.

5-fold cross validation
Overfitting problem:
 Prediction model may be tailored to the eleven
applications if training data = testing data
To avoid the overfitting problem:
 Calculate the accuracy of the prediction model
using 5-fold cross validation
19
Subset 1 Subset 2 Subset 3 Subset 4 Subset 5
Subset 2Subset 1 Subset 3 Subset 4 Subset 5
Build a prediction model trained on Subset 2-5
Used for TESTING
Accuracy : X%, Precision : Y%,...
Used for TESTING
Accuracy : P%, Precision Q%, …
TRAINING DATA (291 samples)
Build a prediction model trained
on Subset 1, 3-5

79.0%
97.6% 99.0% 99.0% 97.2%
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
Range "+=nIRs" "+=dIRs" "+=Array" "ALL (+=DT)"
Accuracy (%), Total number of samples = 291
Accuracies with cross-validation:
160 worker threads or GPU
20Higher is better

Precisions and Recalls
with cross-validation
21
Precision
CPU
Recall
CPU
Precision
GPU
Recall
GPU
Range 79.0% 100% 0% 0%
+=nIRs 97.8% 99.1% 96.5% 91.8%
+=dIRs 98.7% 100% 100% 95.0%
+=Arrays 98.7% 100% 100% 95.0%
ALL 96.7% 100% 100% 86.9%
Higher is better
All prediction models except Range rarely make a bad decision

Discussion
 Based on results with 291 samples,
(range, # of detailed Insns, # of array accesses)
shows the best accuracy
 DT does not contribute to improving the accuracy since
the DT optimizations do not make GPUs faster
 Pros. and Cons.
 (+) Future generations of hardware can be
supported easily by re-running applications
 (+) Just add another training data to rebuild a
prediction model
 (-) Takes time to collect training data
 Loop Range, # of arithmetic, and # of coalesced
accesses affects the decision
22

Related Work:
Java + GPU
23
Lang JIT GPU Kernel Device Selection
JCUDA Java - CUDA GPU only
Lime Lime ✔ Override map/reduce Static
Firepile Scala ✔ reduce Static
JaBEE Java ✔ Override run GPU only
Aparapi Java ✔ map Static
Hadoop-CL Java ✔ Override map/reduce Static
RootBeer Java ✔ Override run Not Described
HJ-OpenCL HJ - forall Static
PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression
Our Work Java ✔ Parallel Stream Dynamic with Machine Learning
None of these approaches considers Java 8 Parallel Stream APIs
and a dynamic device selection with machine-learning

Conclusions
Machine-learning based Performance
Heuristics
 Up to 99% accuracy
 Promising way to build performance heuristics
Future Work
 Exploration of features of program (e.g. CFG)
 Selection of the best configuration from
1 worker, 2 workers ,… 160 workers, GPU
 Parallelizing Training Phase
For more details on GPU code generations
 “Compiling and Optimizing Java 8 Programs for
GPU execution”, PACT15, October 2015
24

Acknowledgements
Special thanks to
IBM CAS
Marcel Mitran (IBM Canada)
Jimmy Kwa (IBM Canada)
Habanero Group at Rice
Yoichi Matsuyama (CMU)
25

Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (6)

Similar a Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection

Similar a Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection (20)

Más de Akihiro Hayashi

Más de Akihiro Hayashi (8)

Último

Último (20)

Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection

Notas del editor