SlideShare una empresa de Scribd logo
1 de 25
Machine-Learning-based
Performance Heuristics
for Runtime CPU/GPU Selection
Akihiro Hayashi (Rice University)
Kazuaki Ishizaki (IBM Research - Tokyo)
Gita Koblents (IBM Canada)
Vivek Sarkar(Rice University)
1
ACM International Conference on Principles and Practices of
Programming on the Java Platform: virtual machines, languages, and tools (PPPJ’15)
Background:
Java 8 Parallel Streams APIs
Explicit Parallelism with lambda
expressions
2
IntStream.range(0, N)
.parallel()
.forEach(i ->
<lambda>)
Background:
Explicit Parallelism with Java
High-level parallel programming with
Java offers opportunities for
preserving portability
enabling compiler to perform parallel-
aware optimizations and code generation
3
Java 8 Parallel Stream API
Multi-
core
CPUs
Many-
core
GPUs
FPGAs
Java 8 Programs
HW
SW
Background:
JIT Compilation for GPU Execution
 IBM Java 8 Compiler
Built on top of the production version of
the IBM Java 8 runtime environment
4
Multi-
core
CPUs
Many-
core
GPUs
method A
method A
method A
method A
Interpretation on
JVM
1st invocation
2nd invocation
Nth invocation
(N+1)th invocation
Native Code
Generation for
Multi-
core
CPUs
Background:
The compilation flow of IBM Java 8 Compiler
5
Java
bytecode
parallel streams
identification in IR
Target machine
code generation
PTX2binary
module
NVIDIA GPU
native code
PowerPC
native code
libnvvm
Our
IR
PTX
JIT compiler Module by NVIDIA
Analysis and
optimizations
Our new modules for GPU
IR for
parallel streams
NVVM IR
generation
NVVM
IR
Existing
optimizations
Bytecode
translation
Runtime Helpers
GPU
Feature
extraction
Optimizations for GPUs
To improve performance
Read Only Cache Utilization / Buffer Alignment
Data Transfer Optimizations
To support Java’s language features
Loop versioning for eliminating redundant
exception checking on GPUs
Virtual method invocation support with
de-virtualization and loop versioning
Motivation:
Runtime CPU/GPU Selection
Selecting a faster hardware device
is a challenging problem
6
Multi-
core
CPUs
Many-
core
GPUs
method A
method A
Nth invocation
(N+1)th invocation
Native Code
Generation for
(Problem)
Which one is
faster?
0e+00 4e+07 8e+07
0.00.20.40.60.81.0
The dynamic number of IR instructions
KernelExecutionTime(msec)
NVIDIA Tesla K40 GPU
IBM POWER8 CPU
Related Work:
Linear regression
Regression based cost estimation
[1,2] is specific to an application
7
App 1) BlackScholes
[1] Leung et al. Automatic Parallelization for Graphics Processing Units (PPPJ’09)
[2] Kerr et al. Modeling GPU-CPU Workloads and Systems (GPGPU-3)
ExecutionTime(msec)
0e+00 4e+07 8e+07
01234
The dynamic number of IR instructions
KernelExecutionTime(msec)
NVIDIA Tesla K40 GPU
IBM POWER8 CPU
ExecutionTime(msec) App 2) Vector Addition
CPU
GPU
GPU
CPU
Open Question:
Accurate cost model is required?
Accurate cost model construction
would be too much
Considerable effort will be needed to
update performance models for future
generations of hardware
8
Multi-
core
CPUs
Many-
core
GPUs
Which one
is faster?
Machine-learning-based performance heuristics
Our Approach:
ML-based Performance Heuristics
A binary prediction model is constructed by
supervised machine learning with support
vector machines (SVMs)
9
bytecode
App A
Prediction
Model
JIT compiler
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
LIBSVM
Training run with JIT Compiler Offline Model Construction
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
CPU
GPU
Features of program
that may affect performance
Loop Range (Parallel Loop Size)
The dynamic number of Instructions
Memory Access
Arithmetic operations
Math Methods
Branch Instructions
Other Instructions
10
Features of program
that may affect performance (Cont’d)
The dynamic number of Array
Accesses
Coalesced Access (a[i]) (aligned access)
Offset Access (a[i+c])
Stride Access (a[c*i])
Other Access (a[b[i]])
Data Transfer Size
H2D Transfer Size
D2H Transfer Size
11
0 5 10 15 20 25 30
050100150200
c : offset or stride size
Bandwidth(GB/s)
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ●
●
●
Offset : array[i+c]
Stride : array[c*i]
An example of feature vector
12
{
"title" :"MRIQ.runGPULambda()V",
"lineNo" :114,
"features" :{
"range": 32768,
"IRs" :{
"Memory": 89128, "Arithmetic": 61447, "Math": 6144,
"Branch": 3074, "Other": 58384
},
"Array Accesses" :{
"Coalesced": 9218, "Offset": 0,
"Stride": 0, "Other": 12288 },
"H2D Transfer" :
[131088,131088,131088,12304,12304,12304,12304,16,16],
"D2H Transfer" :
[131072,131072,0,0,0,0,0,0,0,0]
},
}
Applications
13
Application Source Field Max Size Data Type
BlackScholes Finance 4,194,304 double
Crypt JGF Cryptography Size C (N=50M) byte
SpMM JGF
Numerical
Computing
Size C (N=500K) double
MRIQ Parboil Medical Large (64^3) float
Gemm Polybench
Numerical
Computing
2K x 2K int
Gesummv Polybench 2K x 2K int
Doitgen Polybench 256x256x256 int
Jacobi-1D Polybench N=4M, T=1 int
Matrix
Multiplication
2K x 2K double
Matrix Transpose 2K x 2K double
VecAdd 4M double
Platform
CPU
IBM POWER8 @ 3.69GHz
20-cores
8 SMT threads per cores
= up to 160 threads
256 GB of RAM
GPU
NVIDIA Tesla K40m
12GB of Global Memory
14
Prediction Model Construction
Obtained 291 samples by running 11
applications with different data sets
Choice is either GPU or 160 worker
threads on CPU
15
bytecode
App A
Prediction
Model
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
LIBSVM
3.2
Training run with JIT Compiler Offline Model Construction
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
Speedups and the accuracy with the max
data size: 160 worker threads vs. GPU
16
40.6 37.4
82.0
64.2
27.6
1.4 1.0
4.4
36.7
7.4 5.7
42.7 34.6
58.1
844.7 772.3
1.0
0.1
1.9
1164.8
9.0
1.2
0.0
0.1
1.0
10.0
100.0
1000.0
10000.0
SpeeduprelativetoSEQENTIALJava
(logscale)
Higher is better160 worker threads (Fork/join) GPU
Prediction x ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
How to evaluate prediction models:
TRUE/FALSE POSITIVE/NEGATIVE
4 types of binary prediction results
(Let POSITIVE is CPU, NEGATIVE is GPU)
TRUE POSITIVE
Correctly predicted that CPU is faster
TRUE NEGATIVE
Correctly predicted that GPU is faster
FALSE POSITIVE
Predicted CPU is faster, but GPU is actually faster
FALSE NEGATIVE
Predicted GPU is faster, but CPU is actually faster
17
How to evaluate prediction models:
Accuracy, Precision, and Recall metrics
Accuracy
 the percentage of selections predicted correctly:
 (TP + TN) / (TP + TN + FP + FN)
Precision X (X = CPU or GPU)
 Precision CPU : TP / (TP + FP)
# of samples correctly predicted CPU is faster
/ total # of samples predicted that CPU is faster
 Precision GPU : TN / (TN + FN)
Recall X (X = CPU or GPU)
 Recall CPU : TP / (TP + FN)
# of samples correctly predicted CPU is faster
/ total # of samples labeled that CPU is actually faster
 Recall GPU : TN / (TN + FP)
18
How precise is the model
when it predicts X is
faster.
How does the prediction
hit the nail when X is
actually faster.
How to evaluate prediction models:
5-fold cross validation
Overfitting problem:
 Prediction model may be tailored to the eleven
applications if training data = testing data
To avoid the overfitting problem:
 Calculate the accuracy of the prediction model
using 5-fold cross validation
19
Subset 1 Subset 2 Subset 3 Subset 4 Subset 5
Subset 2Subset 1 Subset 3 Subset 4 Subset 5
Build a prediction model trained on Subset 2-5
Used for TESTING
Accuracy : X%, Precision : Y%,...
Used for TESTING
Accuracy : P%, Precision Q%, …
TRAINING DATA (291 samples)
Build a prediction model trained
on Subset 1, 3-5
79.0%
97.6% 99.0% 99.0% 97.2%
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
Range "+=nIRs" "+=dIRs" "+=Array" "ALL (+=DT)"
Accuracy (%), Total number of samples = 291
Accuracies with cross-validation:
160 worker threads or GPU
20Higher is better
Precisions and Recalls
with cross-validation
21
Precision
CPU
Recall
CPU
Precision
GPU
Recall
GPU
Range 79.0% 100% 0% 0%
+=nIRs 97.8% 99.1% 96.5% 91.8%
+=dIRs 98.7% 100% 100% 95.0%
+=Arrays 98.7% 100% 100% 95.0%
ALL 96.7% 100% 100% 86.9%
Higher is better
All prediction models except Range rarely make a bad decision
Discussion
 Based on results with 291 samples,
(range, # of detailed Insns, # of array accesses)
shows the best accuracy
 DT does not contribute to improving the accuracy since
the DT optimizations do not make GPUs faster
 Pros. and Cons.
 (+) Future generations of hardware can be
supported easily by re-running applications
 (+) Just add another training data to rebuild a
prediction model
 (-) Takes time to collect training data
 Loop Range, # of arithmetic, and # of coalesced
accesses affects the decision
22
Related Work:
Java + GPU
23
Lang JIT GPU Kernel Device Selection
JCUDA Java - CUDA GPU only
Lime Lime ✔ Override map/reduce Static
Firepile Scala ✔ reduce Static
JaBEE Java ✔ Override run GPU only
Aparapi Java ✔ map Static
Hadoop-CL Java ✔ Override map/reduce Static
RootBeer Java ✔ Override run Not Described
HJ-OpenCL HJ - forall Static
PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression
Our Work Java ✔ Parallel Stream Dynamic with Machine Learning
None of these approaches considers Java 8 Parallel Stream APIs
and a dynamic device selection with machine-learning
Conclusions
Machine-learning based Performance
Heuristics
 Up to 99% accuracy
 Promising way to build performance heuristics
Future Work
 Exploration of features of program (e.g. CFG)
 Selection of the best configuration from
1 worker, 2 workers ,… 160 workers, GPU
 Parallelizing Training Phase
For more details on GPU code generations
 “Compiling and Optimizing Java 8 Programs for
GPU execution”, PACT15, October 2015
24
Acknowledgements
Special thanks to
IBM CAS
Marcel Mitran (IBM Canada)
Jimmy Kwa (IBM Canada)
Habanero Group at Rice
Yoichi Matsuyama (CMU)
25

Más contenido relacionado

La actualidad más candente

Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016Ehsan Totoni
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1Kenta Oono
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019Masashi Shibata
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
 
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5Jeff Larkin
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportPreferred Networks
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA Taiwan
 

La actualidad más candente (19)

Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereport
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 

Destacado

2016 04-19 machine learning
2016 04-19 machine learning2016 04-19 machine learning
2016 04-19 machine learningMark Reynolds
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseKazuaki Ishizaki
 
Machine Learning under Attack: Vulnerability Exploitation and Security Measures
Machine Learning under Attack: Vulnerability Exploitation and Security MeasuresMachine Learning under Attack: Vulnerability Exploitation and Security Measures
Machine Learning under Attack: Vulnerability Exploitation and Security MeasuresPluribus One
 
Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Nicolas Nicolov
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applicationsAnish Das
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
 

Destacado (6)

2016 04-19 machine learning
2016 04-19 machine learning2016 04-19 machine learning
2016 04-19 machine learning
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
 
Machine Learning under Attack: Vulnerability Exploitation and Security Measures
Machine Learning under Attack: Vulnerability Exploitation and Security MeasuresMachine Learning under Attack: Vulnerability Exploitation and Security Measures
Machine Learning under Attack: Vulnerability Exploitation and Security Measures
 
Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 

Similar a Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection

Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsDatabricks
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsGanesan Narayanasamy
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)Kohei KaiGai
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Labview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLLabview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLMohammad Sabouri
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 
Parallelism in a NumPy-based program
Parallelism in a NumPy-based programParallelism in a NumPy-based program
Parallelism in a NumPy-based programRalf Gommers
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStromKohei KaiGai
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationTatsuhiro Chiba
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedRCCSRENKEI
 

Similar a Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection (20)

Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Balancing Power & Performance Webinar
Balancing Power & Performance WebinarBalancing Power & Performance Webinar
Balancing Power & Performance Webinar
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Labview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLLabview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRL
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based Modeling
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Parallelism in a NumPy-based program
Parallelism in a NumPy-based programParallelism in a NumPy-based program
Parallelism in a NumPy-based program
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark application
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 

Más de Akihiro Hayashi

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesAkihiro Hayashi
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral CompilationAkihiro Hayashi
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsAkihiro Hayashi
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Akihiro Hayashi
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...Akihiro Hayashi
 
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Akihiro Hayashi
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
 

Más de Akihiro Hayashi (8)

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
 
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
 

Último

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Último (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection

  • 1. Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection Akihiro Hayashi (Rice University) Kazuaki Ishizaki (IBM Research - Tokyo) Gita Koblents (IBM Canada) Vivek Sarkar(Rice University) 1 ACM International Conference on Principles and Practices of Programming on the Java Platform: virtual machines, languages, and tools (PPPJ’15)
  • 2. Background: Java 8 Parallel Streams APIs Explicit Parallelism with lambda expressions 2 IntStream.range(0, N) .parallel() .forEach(i -> <lambda>)
  • 3. Background: Explicit Parallelism with Java High-level parallel programming with Java offers opportunities for preserving portability enabling compiler to perform parallel- aware optimizations and code generation 3 Java 8 Parallel Stream API Multi- core CPUs Many- core GPUs FPGAs Java 8 Programs HW SW
  • 4. Background: JIT Compilation for GPU Execution  IBM Java 8 Compiler Built on top of the production version of the IBM Java 8 runtime environment 4 Multi- core CPUs Many- core GPUs method A method A method A method A Interpretation on JVM 1st invocation 2nd invocation Nth invocation (N+1)th invocation Native Code Generation for Multi- core CPUs
  • 5. Background: The compilation flow of IBM Java 8 Compiler 5 Java bytecode parallel streams identification in IR Target machine code generation PTX2binary module NVIDIA GPU native code PowerPC native code libnvvm Our IR PTX JIT compiler Module by NVIDIA Analysis and optimizations Our new modules for GPU IR for parallel streams NVVM IR generation NVVM IR Existing optimizations Bytecode translation Runtime Helpers GPU Feature extraction Optimizations for GPUs To improve performance Read Only Cache Utilization / Buffer Alignment Data Transfer Optimizations To support Java’s language features Loop versioning for eliminating redundant exception checking on GPUs Virtual method invocation support with de-virtualization and loop versioning
  • 6. Motivation: Runtime CPU/GPU Selection Selecting a faster hardware device is a challenging problem 6 Multi- core CPUs Many- core GPUs method A method A Nth invocation (N+1)th invocation Native Code Generation for (Problem) Which one is faster?
  • 7. 0e+00 4e+07 8e+07 0.00.20.40.60.81.0 The dynamic number of IR instructions KernelExecutionTime(msec) NVIDIA Tesla K40 GPU IBM POWER8 CPU Related Work: Linear regression Regression based cost estimation [1,2] is specific to an application 7 App 1) BlackScholes [1] Leung et al. Automatic Parallelization for Graphics Processing Units (PPPJ’09) [2] Kerr et al. Modeling GPU-CPU Workloads and Systems (GPGPU-3) ExecutionTime(msec) 0e+00 4e+07 8e+07 01234 The dynamic number of IR instructions KernelExecutionTime(msec) NVIDIA Tesla K40 GPU IBM POWER8 CPU ExecutionTime(msec) App 2) Vector Addition CPU GPU GPU CPU
  • 8. Open Question: Accurate cost model is required? Accurate cost model construction would be too much Considerable effort will be needed to update performance models for future generations of hardware 8 Multi- core CPUs Many- core GPUs Which one is faster? Machine-learning-based performance heuristics
  • 9. Our Approach: ML-based Performance Heuristics A binary prediction model is constructed by supervised machine learning with support vector machines (SVMs) 9 bytecode App A Prediction Model JIT compiler feature 1 data 1 bytecode App A data 2 bytecode App B data 3 feature 2 feature 3 LIBSVM Training run with JIT Compiler Offline Model Construction feature extraction feature extraction feature extraction Java Runtime CPU GPU
  • 10. Features of program that may affect performance Loop Range (Parallel Loop Size) The dynamic number of Instructions Memory Access Arithmetic operations Math Methods Branch Instructions Other Instructions 10
  • 11. Features of program that may affect performance (Cont’d) The dynamic number of Array Accesses Coalesced Access (a[i]) (aligned access) Offset Access (a[i+c]) Stride Access (a[c*i]) Other Access (a[b[i]]) Data Transfer Size H2D Transfer Size D2H Transfer Size 11 0 5 10 15 20 25 30 050100150200 c : offset or stride size Bandwidth(GB/s) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Offset : array[i+c] Stride : array[c*i]
  • 12. An example of feature vector 12 { "title" :"MRIQ.runGPULambda()V", "lineNo" :114, "features" :{ "range": 32768, "IRs" :{ "Memory": 89128, "Arithmetic": 61447, "Math": 6144, "Branch": 3074, "Other": 58384 }, "Array Accesses" :{ "Coalesced": 9218, "Offset": 0, "Stride": 0, "Other": 12288 }, "H2D Transfer" : [131088,131088,131088,12304,12304,12304,12304,16,16], "D2H Transfer" : [131072,131072,0,0,0,0,0,0,0,0] }, }
  • 13. Applications 13 Application Source Field Max Size Data Type BlackScholes Finance 4,194,304 double Crypt JGF Cryptography Size C (N=50M) byte SpMM JGF Numerical Computing Size C (N=500K) double MRIQ Parboil Medical Large (64^3) float Gemm Polybench Numerical Computing 2K x 2K int Gesummv Polybench 2K x 2K int Doitgen Polybench 256x256x256 int Jacobi-1D Polybench N=4M, T=1 int Matrix Multiplication 2K x 2K double Matrix Transpose 2K x 2K double VecAdd 4M double
  • 14. Platform CPU IBM POWER8 @ 3.69GHz 20-cores 8 SMT threads per cores = up to 160 threads 256 GB of RAM GPU NVIDIA Tesla K40m 12GB of Global Memory 14
  • 15. Prediction Model Construction Obtained 291 samples by running 11 applications with different data sets Choice is either GPU or 160 worker threads on CPU 15 bytecode App A Prediction Model feature 1 data 1 bytecode App A data 2 bytecode App B data 3 feature 2 feature 3 LIBSVM 3.2 Training run with JIT Compiler Offline Model Construction feature extraction feature extraction feature extraction Java Runtime
  • 16. Speedups and the accuracy with the max data size: 160 worker threads vs. GPU 16 40.6 37.4 82.0 64.2 27.6 1.4 1.0 4.4 36.7 7.4 5.7 42.7 34.6 58.1 844.7 772.3 1.0 0.1 1.9 1164.8 9.0 1.2 0.0 0.1 1.0 10.0 100.0 1000.0 10000.0 SpeeduprelativetoSEQENTIALJava (logscale) Higher is better160 worker threads (Fork/join) GPU Prediction x ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
  • 17. How to evaluate prediction models: TRUE/FALSE POSITIVE/NEGATIVE 4 types of binary prediction results (Let POSITIVE is CPU, NEGATIVE is GPU) TRUE POSITIVE Correctly predicted that CPU is faster TRUE NEGATIVE Correctly predicted that GPU is faster FALSE POSITIVE Predicted CPU is faster, but GPU is actually faster FALSE NEGATIVE Predicted GPU is faster, but CPU is actually faster 17
  • 18. How to evaluate prediction models: Accuracy, Precision, and Recall metrics Accuracy  the percentage of selections predicted correctly:  (TP + TN) / (TP + TN + FP + FN) Precision X (X = CPU or GPU)  Precision CPU : TP / (TP + FP) # of samples correctly predicted CPU is faster / total # of samples predicted that CPU is faster  Precision GPU : TN / (TN + FN) Recall X (X = CPU or GPU)  Recall CPU : TP / (TP + FN) # of samples correctly predicted CPU is faster / total # of samples labeled that CPU is actually faster  Recall GPU : TN / (TN + FP) 18 How precise is the model when it predicts X is faster. How does the prediction hit the nail when X is actually faster.
  • 19. How to evaluate prediction models: 5-fold cross validation Overfitting problem:  Prediction model may be tailored to the eleven applications if training data = testing data To avoid the overfitting problem:  Calculate the accuracy of the prediction model using 5-fold cross validation 19 Subset 1 Subset 2 Subset 3 Subset 4 Subset 5 Subset 2Subset 1 Subset 3 Subset 4 Subset 5 Build a prediction model trained on Subset 2-5 Used for TESTING Accuracy : X%, Precision : Y%,... Used for TESTING Accuracy : P%, Precision Q%, … TRAINING DATA (291 samples) Build a prediction model trained on Subset 1, 3-5
  • 20. 79.0% 97.6% 99.0% 99.0% 97.2% 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% Range "+=nIRs" "+=dIRs" "+=Array" "ALL (+=DT)" Accuracy (%), Total number of samples = 291 Accuracies with cross-validation: 160 worker threads or GPU 20Higher is better
  • 21. Precisions and Recalls with cross-validation 21 Precision CPU Recall CPU Precision GPU Recall GPU Range 79.0% 100% 0% 0% +=nIRs 97.8% 99.1% 96.5% 91.8% +=dIRs 98.7% 100% 100% 95.0% +=Arrays 98.7% 100% 100% 95.0% ALL 96.7% 100% 100% 86.9% Higher is better All prediction models except Range rarely make a bad decision
  • 22. Discussion  Based on results with 291 samples, (range, # of detailed Insns, # of array accesses) shows the best accuracy  DT does not contribute to improving the accuracy since the DT optimizations do not make GPUs faster  Pros. and Cons.  (+) Future generations of hardware can be supported easily by re-running applications  (+) Just add another training data to rebuild a prediction model  (-) Takes time to collect training data  Loop Range, # of arithmetic, and # of coalesced accesses affects the decision 22
  • 23. Related Work: Java + GPU 23 Lang JIT GPU Kernel Device Selection JCUDA Java - CUDA GPU only Lime Lime ✔ Override map/reduce Static Firepile Scala ✔ reduce Static JaBEE Java ✔ Override run GPU only Aparapi Java ✔ map Static Hadoop-CL Java ✔ Override map/reduce Static RootBeer Java ✔ Override run Not Described HJ-OpenCL HJ - forall Static PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression Our Work Java ✔ Parallel Stream Dynamic with Machine Learning None of these approaches considers Java 8 Parallel Stream APIs and a dynamic device selection with machine-learning
  • 24. Conclusions Machine-learning based Performance Heuristics  Up to 99% accuracy  Promising way to build performance heuristics Future Work  Exploration of features of program (e.g. CFG)  Selection of the best configuration from 1 worker, 2 workers ,… 160 workers, GPU  Parallelizing Training Phase For more details on GPU code generations  “Compiling and Optimizing Java 8 Programs for GPU execution”, PACT15, October 2015 24
  • 25. Acknowledgements Special thanks to IBM CAS Marcel Mitran (IBM Canada) Jimmy Kwa (IBM Canada) Habanero Group at Rice Yoichi Matsuyama (CMU) 25

Notas del editor

  1. Here is an example of feature vector. Expressed in JSON format
  2. Convey special thanks to