JMI Techtalk: 한재근 - How to use GPU for developing AI

한재근 | Solutions Architect | jahan@nvidia.com
HOW TO USE GPU
FOR DEVELOPING AI

2
YEARLY GRAPHICS CARD RELEASE
https://www.reddit.com/r/pcmasterrace/comments/6xpiex/yearly_graphics_card_releases/

3
NOW WE HAVE VOLTA AND TURING
Tesla, RTX, TITAN, …
Tesla
Titan
GeForce
V100 T4
Tensor Core
Enabled

4
APPS &
FRAMEWORKS
NVIDIA SDK
& LIBRARIES
TESLA UNIVERSAL ACCELERATION PLATFORM
Single Platform Drives Utilization and Productivity
MACHINE LEARNING/ ANALYTICS
cuMLcuDF cuGRAPH
CUDA
DEEP LEARNING
cuDNN cuBLAS CUTLASS NCCL TensorRT
HPC
CuBLAS OpenACCCuFFT
+550
Applications
Amber
NAMD
CUSTOMER
USECASES
CONSUMER INTERNET
Speech Translate Recommender
SCIENTIFIC APPLICATIONS
Molecular
Simulations
Weather
Forecasting
Seismic
Mapping
INDUSTRIAL APPLICATIONS
ManufacturingHealthcare Finance
TESLA GPUs
& SYSTEMS
SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILYVIRTUAL GPU

5
AGENDA
• GPU for Deep Learning
• Mixed precision training
• Inference optimization
• GPU for Machine Learning / Analytics
• Deploying your intelligence

6
Training
Device
GPU DEEP LEARNING
IS A NEW COMPUTING MODEL
Training
Billions of Trillions of Operations
GPU train larger models, accelerate
time to market
Inference
Datacenter infererence
10s of billions of image, voice, video
queries per day
GPU inference for fast response,
maximize data center throughput

8
TESLA PLATFORM ENABLES DRAMATIC
REDUCTION IN TIME TO TRAIN
0 20 40 60 80 100 120 140
2x CPU
Single Node
1X P100
Single Node
1X V100
DGX-1
8x V100
At scale
2176x V100
Relative Time to Train Improvements
(ResNet-50)
ResNet-50, 90 epochs to solution | CPU Server: dual socket Intel Xeon Gold 6140
Sony 2176x V100 record on https://nnabla.org/paper/imagenet_in_224sec.pdf
<4 Minutes
3.3 Hours
25 Days
30 Hours
4.8 Days
3.84x

9
TESLA V100
TENSOR CORE GPU
World’s Most Advanced
Data Center GPU
5,120 CUDA cores
640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS
| 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
32 GB HBM2 @ 900GB/s |
300GB/s NVLink

10
WHAT IS TENSOR CORE
Programmable matrix-multiply and accumulate units
• 125 Tensor TFLOPS in V100
• Each core provides 4x4x4 matrix processing
• Programmable using CUDA WMMA, CUDNN, CUBLAS
• Optimized NHWC Tensor Layout
D = A * B + C
https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/

11
PERFORMANCE CONTRIBUTION
Using half precision and Volta your networks can be:
1. 2-4x faster
2. half the size
3. just as powerful
with no architecture change.

12
TRAINING WITH HALF PRECISION
Issue in FP16 Dynamic Range
[2-149, ~2128]
[~1.4e-45, ~3.4e38]
[2-24, ~215 ]
[~5.96e-8, 65,504]

13
A MIXED PRECISION SOLUTION
Imprecise weight updates
Gradients underflow
Reductions overflow
“master” weights in FP32
Loss (Gladient) Scaling
Accumulate to FP32

14
FP32 TRAINING
FP32
Weights
Forward
Pass
Backprop
FP32
Gradients
Apply
FP32 Loss

15
MIXED SOLUTION 1: FP32 MASTER WEIGHTS
FP32 Master
Weights
Forward
Pass
Backprop
FP32 Master
Gradients
FP16
Gradients
FP16
Weights
Apply
Cast toFP32
Copy
Cast to FP16
FP16 Loss
!?

16
GRADIENTS MAY UNDERFLOW
Shift by scaling
https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/

17
MIXED PRECISION TRAINING
FP32 Master
Weights
Forward
Pass
FP16
Loss
Scaled
FP16
Loss Backprop
Scaled
FP32
Gradients
FP32 Gradients
Scaled
FP16
Gradients
FP16
Weights
/ loss_scale,
updateApply
Cast tofp32
Copy
* loss_scale

18
REDUCTION MAY OVERFLOW
a = torch.cuda.HalfTensor(4094).fill(4.0)
a.norm(2, dim=0)
a = torch.cuda.HalfTensor(4095).fill(4.0)
a.norm(2, dim=0)
256
INF
Reductions like norm overflow if > 65,504 is encountered

19
MIXED PRECISION TRAINING
FP32 Master
Weights
Forward
Pass
FP32
Loss
Scaled
FP16
Loss Backprop
Scaled
FP32
Gradients
FP32 Gradients
Scaled
FP16
Gradients
FP16
Weights
/ loss_scale,
updateApply
Cast tofp32
Copy
cast to
fp16
* loss_scale

20
DON’T PANIC
all deep learning frameworks support since
September 2017 (CUDA 9.0)
Image Source: http://gtcarlot.com/data/Audi/R8/2012/54378351/Transmission-56139263.html

21
TENSORFLOW EXAMPLE
def build_forward_model(inputs):
_, _, h, w = inputs.get_shape().as_list()
top_layer = inputs
top_layer = tf.layers.conv2d(top_layer, 64, 7, use_bias=False,
data_format='channels_first', padding='SAME’)
top_layer = tf.contrib.layers.batch_norm(top_layer,
data_format='NCHW', fused=True)
top_layer = tf.layers.max_pooling2d(top_layer, 2, 2, data_format='channels_first’)
top_layer = tf.reshape(top_layer, (-1, 64 * (h // 2) * (w // 2)))
top_layer = tf.layers.dense(top_layer, 128, activation=tf.nn.relu)
return top_layer
Tensor Core
FP16 operation
Mixed Precision ops

22
TENSORFLOW EXAMPLE
def build_training_model(inputs, labels, nlabel):
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
gradvars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(gradvars)
return inputs, labels, loss, train_op
FP32 Training

23
TENSORFLOW EXAMPLE
def build_training_model(inputs, labels, nlabel):
inputs = tf.cast(inputs, tf.float16)
with tf.variable_scope('fp32_vars’,
custom_getter=float32_variable_storage_getter):
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
logits = tf.cast(logits, tf.float32)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
loss_scale = 128.0 # Value may need tuning depending on the model
gradients, variables = zip(*optimizer.compute_gradients(loss * loss_scale))
gradients = [grad / loss_scale for grad in gradients]
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
train_op = optimizer.apply_gradients(zip(gradients, variables))
return inputs, labels, loss, train_op
Casting FP16 for input
Casting FP32 for loss
Master weights in FP32
Scaling loss and
get gradients in FP16
Loss scale down with FP32
Gradients clipping (opt)

24
N, D_in, D_out = 64, 1000, 10
x = Variable(torch.randn(N, D_in )).cuda()
y = Variable(torch.randn(N, D_out)).cuda()
model = torch.nn.Linear(D_in, D_out).cuda()
model_params, master_params = prep_param_list(model)
optimizer = torch.optim.SGD(master_params, lr=1e-3)
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred)
optimizer.zero_grad()
loss.backward()
optimizer.step()
PyTorch Example

25
N, D_in, D_out = 64, 1000, 10
scale_factor = 128.0
x = Variable(torch.randn(N, D_in )).cuda().half()
y = Variable(torch.randn(N, D_out)).cuda().half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
model_params, master_params = prep_param_list(model)
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred.float(),y.float())
scaled_loss = scale_factor * loss.float()
model.zero_grad()
loss.backward()
model_grads_to_master_grads(model_params, master_params)
for param in master.params:
param.grad.data.mul_(1./scale_factor)
optimizer.step()
master_params_to_model_params(model_params, master_params)
MIXED SOLUTION: LOSS (GRADIENT) SCALING
Gradients are now rescaled to be representable
The FP32 master gradients must be
"descaled"
loss is now FP32
(Model grads are still FP16)
Casting Weights
from apex.fp16_utils import *

26
from apex.fp16_utils import *
from apex.optimizers import FP16_Optimizer
N, D_in, D_out = 64, 1000, 10
x = Variable(torch.randn(N, D_in )).cuda().half()
y = Variable(torch.randn(N, D_out)).cuda().half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
if loss_scale == 0:
optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
else:
optimizer = FP16_Optimizer(optimizer, static_loss_scale=loss_scale)
optimizer.zero_grad()
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred)
if fp16:
optimizer.backward(loss)
else:
loss.backward()
optimizer.step()
Example
Dynamic loss scale
With FP16_Optimizer

27
CAVEATS
Use float32 for certain ops to avoid overflow or underflow
• Reductions (e.g., norm, softmax)
• Range-expanding math functions (e.g., exp, pow)
Apply loss scaling to avoid gradients underflow
https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/
https://github.com/NVIDIA/DeepLearningExamples

28
PERFORMANCE TRADEOFFS
FP32 Master
Weights
Forward
Pass
FP32
Loss
Scaled
FP16
Loss Backprop
Scaled
FP32
Gradients
FP32 Gradients
Scaled
FP16
Gradients
FP16
Weights
/ loss_scale,
updateApply
Cast tofp32
Copy
cast to
fp16
* loss_scale
Accelerated Overhead
Forward/Backward acceleration vs gradient type casting (fp16-fp32) overhead

29
FURTHER UNDERSTANDING
Mixed Precision Lectures
Training Neural Networks with Mixed Precision: Theory and Practice
http://on-demand.gputechconf.com/gtc/2018/video/S8923/
Training Neural Networks with Mixed Precision: Real Examples
http://on-demand.gputechconf.com/gtc/2018/video/S81012/
Training Neural Networks with Mixed Precision
http://on-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-
1_Internal%20Speaker_Michael%20Carilli_PDF%20For%20Sharing.pdf

30
BERT OPTIMIZATION CASE
https://github.com/google-research/bert/pull/255

31
DOES IT WORKS?
NVIDIA Profiler for the detail
volta_s884cudnn_fp16_64x64_sliced1x4_ldg8_wgrad_idx_exp_interior_nhwc_nt
nvprof -t 20 --print-gpu-trace python main.py --fp16 /datasets
Architecture identifier
library
precision
Applied size

33
GPU INFERENCE ADOPTION IS ACCELERATING
60X Latency Improvement
Real-Time Search
12X Faster Inference
Live Video Analysis
40X Higher Performance
Real-Time Brand Impact
Tesla P4, TensorRT Adoption
Use Cases
VISUAL SEARCH VIDEO ANALYSIS ADVERTISING INFERENCE USE CASES
Video
MapsImage
NLP
Speech
Search

34
Kernel
Auto-Tuning
Layer &
Tensor Fusion
Dynamic
Tensor
Memory
Precision
Calibration
NVIDIA TensorRT 5
Inference Optimizer and Runtime
developer.nvidia.com/tensorrt
Data center, embedded & automotive
In-framework support for TensorFlow
Support for all other frameworks and ONNX
TensorRT inference server microservice with Docker and
Kubernetes integration
New layers and APIs
New OS support for Windows and CentOS
DRIVE PX 2
JETSON TX2
NVIDIA DLA
TESLA P4/T4
TESLA V100
FRAMEWORKS GPU PLATFORMS
TensorRT
Optimizer Runtime
*New in TRT5

35
BREAKTHROUGH RESNET-50
INFERENCE PERFORMANCE
4,365
6,379
0
1500
3000
4500
6000
7500
THROUGHPUT
Tesla T4 Tesla V100
63
22
0
16
32
48
64
80
ENERGY EFFICIENCY
1
0.89
0.8
0.85
0.9
0.95
1
1.05
LATENCY
GPU: Dual-Socket Xeon Gold 6140@3.6GHz with GPUs as shown 18.11-py3 |
TensorRT 5.0 | T4: INT8, V100: Mixed | Batch Size = 128
GPU: Dual-Socket Xeon Gold 6140@3.6GHz with GPUs as shown
18.11-py3 | TensorRT 5.0 | INT8 | Batch Size = 1
GPU: Dual-Socket Xeon Gold 6140@3.6GHz @3.6GHz with GPUs as shown
18.11-py3 | TensorRT 5.0 | T4: INT8, V100: Mixed | Batch Size = 128
Tesla T4 Tesla V100 Tesla T4 Tesla V100
img/s Milliseconds – lower is better img/s/watt

36
BREAKTHROUGH NMT
INFERENCE PERFORMANCE
34,122
67,124
0
15000
30000
45000
60000
75000
THROUGHPUT
Tesla T4 Tesla V100
763
924
0
200
400
600
800
1000
ENERGY EFFICIENCY
21
14
0
5
10
15
20
25
LATENCY
GPU: Dual-Socket Xeon E5-2698 v4@3.6GHz with GPU servers as shown |
18.11-py3 | TensorRT 5.0 | Mixed | Batch Size = 128
Tesla T4 Tesla V100 Tesla T4 Tesla V100
tokens/s Milliseconds – lower is better tokens/s/watt

37
INFERENCE SERVER ARCHITECTURE
Models supported
● TensorFlow GraphDef/SavedModel
● TensorFlow and TensorRT GraphDef
● TensorRT Plans
● Caffe2 NetDef (ONNX import)
Multi-GPU support
Concurrent model execution
Server HTTP REST API/gRPC
Python/C++ client libraries
Python/C++ Client Library
Available with Monthly Updates

38
FLEXIBLE MODEL DEPLOYMENT TO BALANCE
CONVENIENCE WITH PERFORMANCE
Native Framework
• Minimal conversion
• Least performant
• Support CPU and GPU
Framework +
TensorRT Runtime
• Some conversion
• Framework fallback for
unsupported layers
• TensorRT performance
with FP16 and INT8
• Supports GPU
TensorRT Runtime
• More conversion
• Most performant with low
memory footprint
• Precision control FP16
and INT8
• Supports GPU
More performance
+
Less conversion

39
GREAT PERFORMANCE FOR MULTIPLE
MODEL DEPLOYMENTS OF RN50
RN50 with 50ms latency SLA
across various deployments
• CPU: TensorFlow FP32
• GPU - V100 16GB: TensorFlow
FP32
• GPU - V100 16GB: TensorRT
FP16

40
INFERENCE SERVER ARCHITECTURE
KUBEFLOW
Kubeflow guest blog: https://www.kubeflow.org/blog/nvidia_tensorrt/

41
● One model per GPU
● Requests are steady across all models
● Utilization is low on all GPUs
● Spike in requests for blue model
● GPUs running blue model are being fully utilized
● Other GPUs remain underutilized
Before TensorRT Inference Server - 5,000 FPSBefore TensorRT Inference Server - 800 FPS
TENSORRT INFERENCE SERVER
METRICS FOR AUTOSCALING

42
● Load multiple models on every GPU
● Load is evenly distributed between all GPUs
● Spike in requests for blue model
● Each GPU can run the blue model concurrently
● Metrics to indicate time to scale up
○ GPU utilization
○ Power usage
○ Inference count
○ Queue time
○ Number of requests/sec
After TensorRT Inference Server - 15,000 FPSAfter TensorRT Inference Server - 5,000 FPS
TENSORRT INFERENCE SERVER
METRICS FOR AUTOSCALING

43
MACHINE LEARNING /
ANALYTICS

44
THE BIG PROBLEM IN DATA SCIENCE
All
Data
ETL
Manage Data
Structured
Data Store
Data
Preparation
Training
Model
Training
Visualization
Evaluate
Inference
Deploy
Slow Training Times for
Data Scientists

45
RAPIDS — OPEN GPU DATA SCIENCE
Software Stack
Data Preparation VisualizationModel Training
CUDA
PYTHON
APACHE ARROW
DASK
DEEP LEARNING
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH

DEPLOYING RAPIDS — FASTER SPEEDS,
REAL WORLD BENEFITS
ML NVIDIA DGX-2
0 1800 3600 5400 7200
DGX-2
100 CPU Nodes
50 CPU Nodes
20 CPU Nodes
0 600 1200 1800 2400 3000 3600
DGX-2
100 CPU Nodes
50 CPU Nodes
20 CPU Nodes
1 Hour
SECONDS
ETL
2 Hours
SECONDS
Fannie Mae
mortgage dataset
Fast Loading w/
cuDF
cuML
xgboost
GPU Visualization
https://www.youtube.com/watch?v=G1kx_7NJJGA&feature=youtu.be&t=4287

48
A CONSISTENT, HYBRID CLOUD EXPERIENCE
ACROSS COMPUTE PLATFORMS

49
Containerized Applications
TensorFlow PyTorch MXNet
TensorRT
Inference Server
CUDA RTCUDA RTCUDA RTCUDA RT
Linux Kernel + CUDA Driver
Tuned SW
CUDA RT
Other
Frameworks
and Apps. . .
OPTIMIZED AND UP-TO-DATE
The Top Deep Learning Containers are Tuned and Optimized Monthly to
Deliver Maximum Performance on NVIDIA GPUs
NVIDIA Container
Runtime for Docker
NVIDIA Container
Runtime for Docker
NVIDIA Container
Runtime for Docker
NVIDIA Container
Runtime for Docker
NVIDIA Container
Runtime for Docker

50
MONTHLY IMPROVEMENT
Over 12 months, up to 1.8X improvement with mixed-precision on ResNet-50
https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html
CUDA 10 based TensorFlow

JMI Techtalk: 한재근 - How to use GPU for developing AI

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a JMI Techtalk: 한재근 - How to use GPU for developing AI

Similar a JMI Techtalk: 한재근 - How to use GPU for developing AI (20)

Más de Lablup Inc.

Más de Lablup Inc. (19)

Último

Último (20)

JMI Techtalk: 한재근 - How to use GPU for developing AI