Graphics processing units (GPUs) are becoming integral components of modern machine learning engines and platforms. These will provide an introduction to GPUs and their suitability for machine learning workloads. They also discuss enabling technologies, such as CUDA, and demonstrate GPU-accelerated machine learning with the H2O platform. These slides are targeted to machine learning practitioners new to GPUs.
Author: Wen Phan is a Senior Solutions Architect at H2O.ai. Wen works with customers and organizations to architect systems, smarter applications, and data products to make better decisions, achieve positive outcomes, and transform the way they do business. Internally, Wen uses his hard-earned field experiences, customer feedback, and market trends to drive product innovation and development. Wen holds a B.S. in Electrical Engineering and M.S. in Analytics and Decision Sciences.
Follow him on twitter: @wenphan
2. Agenda
• Context and Why GPUs?
– Matrix Multiplication Example
• CUDA
• GPU and Machine Learning
– Deep Learning
– Parallel Computing: GBM, GLM
• Getting Started
• Others
3. Need for More Compute
• Lots of Data
• Complex Architectures
• Many Models
4. Historic Ways for More Compute
• Faster Clock Rates
• Multi-Core
• Distributed Computing
5. CPU Trends
Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten,
dotted line extrapolations by C. Moore
9. GPUs for Parallel Tasks
Traditional CPUs are
not economically feasible
2.3 PFlops 7000 homes
7.0
Megawatts
7.0
Megawatts
CPU
Optimized for
Serial Tasks
GPU Accelerator
Optimized for Many
Parallel Tasks
10x performance/socket
> 5x energy efficiency
Era of GPU-accelerated
computing is here
nVIDIA
10. GPU Devotes More Transistors to Data Processing
CUDA C Programming Guide
12. Latency Versus Throughput
• Latency: Time to do a task.
• Throughput: Number of tasks per unit time.
• Fictitious Example:
– CPU
• Latency: 1 ns per task
• Throughput: (1 task per ns) x (6 cores) = 6 task per ns
– GPU
• Latency: 10 ns per task
• Throughput: (0.1 task per ns) x (2000 cores) = 200 task per ns
• CPUs are latency optimized; GPUs are throughput optimized
20. CUDA
• Historically, GPUs were used for, well, graphics processing. But, people realized that the fine-
grained parallelism inherently in GPU architecture could be exploited for general purpose
computing.
• CUDA (Compute Unified Device Architecture)
– Parallel computing platform
– Programming model and API
– Allows enabled GPUs for general purpose processing
21. Speed Up Parallelizable Code
Application Code
GPU
Use GPU to
Parallelize
Compute-Intensive
Functions
CPU
Rest of Sequential
CPU Code
nVIDIA
46. Convolutional Neural Networks
• Leverages the fact that data has spatial structure
– Add idea of locality
• Tremendous success with computer vision tasks
• “Put deep learning on the map”
52. Convolutional Layer
• f = receptive field
(filter size)
• p = padding
• s = stride
• m = number of filters
Input Volume Output Volume
Convolution
wI
hI
dI
wO
dO
hO
wO =
wI f + 2p
s
+ 1
hO =
wI f + 2p
s
+ 1
dO = m
59. ImageNet Entries Using GPUs
https://devblogs.nvidia.com/parallelforall/nvidia-ibm-cloud-support-imagenet-large-scale-visual-recognition-challenge/
60. Deep Water: Next-Gen Distributed Deep Learning
One Interface - GPU Enabled - Significant Performance Gains
Inherits All H2O Properties in Scalability, Ease of Use and Deployment
Recurrent Neural Networks
enabling natural language
processing, sequences, time series,
and more
Convolutional Neural Networks
enabling Image, video, speech
recognition
Hybrid Neural Network Architectures
enabling speech to text translation,
image captioning, scene parsing and
more
H2O integrates with existing GPU
backends for significant performance
gains
H2O Deep Learning Algo
77. GBM Data Parallelism
1
2
K
X = {X1, . . . , XK}
math (X1)
math (X2)
math (XK)
{Xi; ti} = f(math (X1) , . . . , math (XK))
78. GBM Data Parallelism
1
2
K
X = {X1, . . . , XK}
math (X1)
math (X2)
math (XK)
{Xi; ti} = f(math (X1) , . . . , math (XK))
Full Data Parallelism for Each Level of Tree Growth!
80. GPU Cluster
• Level GPUs to accelerate processing and fine-grain parallelism on each node
CPU
GPU
math (X1) math (X2) math (XK)math (X3)
81. GBM on GPU
T1(x) T2(x) T3(x) TM (x)
Application Code
GPU
Use GPU to
Compute-Intensive
Functions
CPU
Rest of Sequential
CPU Code
82. Parallel Computing
• Model Parallelism: Split up a single model
• Data Parallelism: Split up data to train a single model
• Training Parallelism: Split up different parts of the training process
– Ensemble Base Learners
– Cross-Validation
– Hyperparameters