이 Techtalk에서는 AI 개발을 위해 GPU를 사용할 때 Nvidia가 제공하는 성능 향상을 위한 다양한 방법들을 기술자료들과 함께 소개합니다. 특히 Volta 아키텍처를 기반으로 Mixed precision을 도입하여 성능을 향상하는 과정에 관한 내용을 자세히 다룹니다.
This Techtalk introduces a variety of ways to improve the performance that Nvidia provides when using the GPU for AI development, along with technical resources. In particular, this talk discusses the process of improving performance by introducing mixed precision based on the Volta architecture.
3. 3
NOW WE HAVE VOLTA AND TURING
Tesla, RTX, TITAN, …
Tesla
Titan
GeForce
V100 T4
Tensor Core
Enabled
4. 4
APPS &
FRAMEWORKS
NVIDIA SDK
& LIBRARIES
TESLA UNIVERSAL ACCELERATION PLATFORM
Single Platform Drives Utilization and Productivity
MACHINE LEARNING/ ANALYTICS
cuMLcuDF cuGRAPH
CUDA
DEEP LEARNING
cuDNN cuBLAS CUTLASS NCCL TensorRT
HPC
CuBLAS OpenACCCuFFT
+550
Applications
Amber
NAMD
CUSTOMER
USECASES
CONSUMER INTERNET
Speech Translate Recommender
SCIENTIFIC APPLICATIONS
Molecular
Simulations
Weather
Forecasting
Seismic
Mapping
INDUSTRIAL APPLICATIONS
ManufacturingHealthcare Finance
TESLA GPUs
& SYSTEMS
SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILYVIRTUAL GPU
5. 5
AGENDA
• GPU for Deep Learning
• Mixed precision training
• Inference optimization
• GPU for Machine Learning / Analytics
• Deploying your intelligence
6. 6
Training
Device
GPU DEEP LEARNING
IS A NEW COMPUTING MODEL
Training
Billions of Trillions of Operations
GPU train larger models, accelerate
time to market
Inference
Datacenter infererence
10s of billions of image, voice, video
queries per day
GPU inference for fast response,
maximize data center throughput
8. 8
TESLA PLATFORM ENABLES DRAMATIC
REDUCTION IN TIME TO TRAIN
0 20 40 60 80 100 120 140
2x CPU
Single Node
1X P100
Single Node
1X V100
DGX-1
8x V100
At scale
2176x V100
Relative Time to Train Improvements
(ResNet-50)
ResNet-50, 90 epochs to solution | CPU Server: dual socket Intel Xeon Gold 6140
Sony 2176x V100 record on https://nnabla.org/paper/imagenet_in_224sec.pdf
<4 Minutes
3.3 Hours
25 Days
30 Hours
4.8 Days
3.84x
9. 9
TESLA V100
TENSOR CORE GPU
World’s Most Advanced
Data Center GPU
5,120 CUDA cores
640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS
| 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
32 GB HBM2 @ 900GB/s |
300GB/s NVLink
10. 10
WHAT IS TENSOR CORE
Programmable matrix-multiply and accumulate units
• 125 Tensor TFLOPS in V100
• Each core provides 4x4x4 matrix processing
• Programmable using CUDA WMMA, CUDNN, CUBLAS
• Optimized NHWC Tensor Layout
D = A * B + C
https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/
11. 11
PERFORMANCE CONTRIBUTION
Using half precision and Volta your networks can be:
1. 2-4x faster
2. half the size
3. just as powerful
with no architecture change.
12. 12
TRAINING WITH HALF PRECISION
Issue in FP16 Dynamic Range
[2-149, ~2128]
[~1.4e-45, ~3.4e38]
[2-24, ~215 ]
[~5.96e-8, 65,504]
13. 13
A MIXED PRECISION SOLUTION
Imprecise weight updates
Gradients underflow
Reductions overflow
“master” weights in FP32
Loss (Gladient) Scaling
Accumulate to FP32
18. 18
REDUCTION MAY OVERFLOW
a = torch.cuda.HalfTensor(4094).fill(4.0)
a.norm(2, dim=0)
a = torch.cuda.HalfTensor(4095).fill(4.0)
a.norm(2, dim=0)
256
INF
Reductions like norm overflow if > 65,504 is encountered
19. 19
MIXED PRECISION TRAINING
FP32 Master
Weights
Forward
Pass
FP32
Loss
Scaled
FP16
Loss Backprop
Scaled
FP32
Gradients
FP32 Gradients
Scaled
FP16
Gradients
FP16
Weights
/ loss_scale,
updateApply
Cast tofp32
Copy
cast to
fp16
* loss_scale
20. 20
DON’T PANIC
all deep learning frameworks support since
September 2017 (CUDA 9.0)
Image Source: http://gtcarlot.com/data/Audi/R8/2012/54378351/Transmission-56139263.html
23. 23
TENSORFLOW EXAMPLE
def build_training_model(inputs, labels, nlabel):
inputs = tf.cast(inputs, tf.float16)
with tf.variable_scope('fp32_vars’,
custom_getter=float32_variable_storage_getter):
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
logits = tf.cast(logits, tf.float32)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
loss_scale = 128.0 # Value may need tuning depending on the model
gradients, variables = zip(*optimizer.compute_gradients(loss * loss_scale))
gradients = [grad / loss_scale for grad in gradients]
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
train_op = optimizer.apply_gradients(zip(gradients, variables))
return inputs, labels, loss, train_op
Casting FP16 for input
Casting FP32 for loss
Master weights in FP32
Scaling loss and
get gradients in FP16
Loss scale down with FP32
Gradients clipping (opt)
24. 24
N, D_in, D_out = 64, 1000, 10
x = Variable(torch.randn(N, D_in )).cuda()
y = Variable(torch.randn(N, D_out)).cuda()
model = torch.nn.Linear(D_in, D_out).cuda()
model_params, master_params = prep_param_list(model)
optimizer = torch.optim.SGD(master_params, lr=1e-3)
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred)
optimizer.zero_grad()
loss.backward()
optimizer.step()
PyTorch Example
25. 25
N, D_in, D_out = 64, 1000, 10
scale_factor = 128.0
x = Variable(torch.randn(N, D_in )).cuda().half()
y = Variable(torch.randn(N, D_out)).cuda().half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
model_params, master_params = prep_param_list(model)
optimizer = torch.optim.SGD(master_params, lr=1e-3)
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred.float(),y.float())
scaled_loss = scale_factor * loss.float()
model.zero_grad()
loss.backward()
model_grads_to_master_grads(model_params, master_params)
for param in master.params:
param.grad.data.mul_(1./scale_factor)
optimizer.step()
master_params_to_model_params(model_params, master_params)
MIXED SOLUTION: LOSS (GRADIENT) SCALING
Gradients are now rescaled to be representable
The FP32 master gradients must be
"descaled"
loss is now FP32
(Model grads are still FP16)
Casting Weights
from apex.fp16_utils import *
26. 26
from apex.fp16_utils import *
from apex.optimizers import FP16_Optimizer
N, D_in, D_out = 64, 1000, 10
x = Variable(torch.randn(N, D_in )).cuda().half()
y = Variable(torch.randn(N, D_out)).cuda().half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
optimizer = torch.optim.SGD(master_params, lr=1e-3)
if loss_scale == 0:
optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
else:
optimizer = FP16_Optimizer(optimizer, static_loss_scale=loss_scale)
for t in range(500):
optimizer.zero_grad()
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred)
if fp16:
optimizer.backward(loss)
else:
loss.backward()
optimizer.step()
Example
Dynamic loss scale
With FP16_Optimizer
27. 27
CAVEATS
Use float32 for certain ops to avoid overflow or underflow
• Reductions (e.g., norm, softmax)
• Range-expanding math functions (e.g., exp, pow)
Apply loss scaling to avoid gradients underflow
https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/
https://github.com/NVIDIA/DeepLearningExamples
29. 29
FURTHER UNDERSTANDING
Mixed Precision Lectures
Training Neural Networks with Mixed Precision: Theory and Practice
http://on-demand.gputechconf.com/gtc/2018/video/S8923/
Training Neural Networks with Mixed Precision: Real Examples
http://on-demand.gputechconf.com/gtc/2018/video/S81012/
Training Neural Networks with Mixed Precision
http://on-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-
1_Internal%20Speaker_Michael%20Carilli_PDF%20For%20Sharing.pdf
33. 33
GPU INFERENCE ADOPTION IS ACCELERATING
60X Latency Improvement
Real-Time Search
12X Faster Inference
Live Video Analysis
40X Higher Performance
Real-Time Brand Impact
Tesla P4, TensorRT Adoption
Use Cases
VISUAL SEARCH VIDEO ANALYSIS ADVERTISING INFERENCE USE CASES
Video
MapsImage
NLP
Speech
Search
34. 34
Kernel
Auto-Tuning
Layer &
Tensor Fusion
Dynamic
Tensor
Memory
Precision
Calibration
NVIDIA TensorRT 5
Inference Optimizer and Runtime
developer.nvidia.com/tensorrt
Data center, embedded & automotive
In-framework support for TensorFlow
Support for all other frameworks and ONNX
TensorRT inference server microservice with Docker and
Kubernetes integration
New layers and APIs
New OS support for Windows and CentOS
DRIVE PX 2
JETSON TX2
NVIDIA DLA
TESLA P4/T4
TESLA V100
FRAMEWORKS GPU PLATFORMS
TensorRT
Optimizer Runtime
*New in TRT5
35. 35
BREAKTHROUGH RESNET-50
INFERENCE PERFORMANCE
4,365
6,379
0
1500
3000
4500
6000
7500
THROUGHPUT
Tesla T4 Tesla V100
63
22
0
16
32
48
64
80
ENERGY EFFICIENCY
1
0.89
0.8
0.85
0.9
0.95
1
1.05
LATENCY
GPU: Dual-Socket Xeon Gold 6140@3.6GHz with GPUs as shown 18.11-py3 |
TensorRT 5.0 | T4: INT8, V100: Mixed | Batch Size = 128
GPU: Dual-Socket Xeon Gold 6140@3.6GHz with GPUs as shown
18.11-py3 | TensorRT 5.0 | INT8 | Batch Size = 1
GPU: Dual-Socket Xeon Gold 6140@3.6GHz @3.6GHz with GPUs as shown
18.11-py3 | TensorRT 5.0 | T4: INT8, V100: Mixed | Batch Size = 128
Tesla T4 Tesla V100 Tesla T4 Tesla V100
img/s Milliseconds – lower is better img/s/watt
36. 36
BREAKTHROUGH NMT
INFERENCE PERFORMANCE
34,122
67,124
0
15000
30000
45000
60000
75000
THROUGHPUT
Tesla T4 Tesla V100
763
924
0
200
400
600
800
1000
ENERGY EFFICIENCY
21
14
0
5
10
15
20
25
LATENCY
GPU: Dual-Socket Xeon E5-2698 v4@3.6GHz with GPU servers as shown |
18.11-py3 | TensorRT 5.0 | Mixed | Batch Size = 128
GPU: Dual-Socket Xeon E5-2698 v4@3.6GHz with GPU servers as shown |
18.11-py3 | TensorRT 5.0 | Mixed | Batch Size = 1
GPU: Dual-Socket Xeon E5-2698 v4@3.6GHz with GPU servers as shown |
18.11-py3 | TensorRT 5.0 | Mixed | Batch Size = 128
Tesla T4 Tesla V100 Tesla T4 Tesla V100
tokens/s Milliseconds – lower is better tokens/s/watt
37. 37
INFERENCE SERVER ARCHITECTURE
Models supported
● TensorFlow GraphDef/SavedModel
● TensorFlow and TensorRT GraphDef
● TensorRT Plans
● Caffe2 NetDef (ONNX import)
Multi-GPU support
Concurrent model execution
Server HTTP REST API/gRPC
Python/C++ client libraries
Python/C++ Client Library
Available with Monthly Updates
38. 38
FLEXIBLE MODEL DEPLOYMENT TO BALANCE
CONVENIENCE WITH PERFORMANCE
Native Framework
• Minimal conversion
• Least performant
• Support CPU and GPU
Framework +
TensorRT Runtime
• Some conversion
• Framework fallback for
unsupported layers
• TensorRT performance
with FP16 and INT8
• Supports GPU
TensorRT Runtime
• More conversion
• Most performant with low
memory footprint
• Precision control FP16
and INT8
• Supports GPU
More performance
+
Less conversion
39. 39
GREAT PERFORMANCE FOR MULTIPLE
MODEL DEPLOYMENTS OF RN50
RN50 with 50ms latency SLA
across various deployments
• CPU: TensorFlow FP32
• GPU - V100 16GB: TensorFlow
FP32
• GPU - V100 16GB: TensorRT
FP16
41. 41
● One model per GPU
● Requests are steady across all models
● Utilization is low on all GPUs
● Spike in requests for blue model
● GPUs running blue model are being fully utilized
● Other GPUs remain underutilized
Before TensorRT Inference Server - 5,000 FPSBefore TensorRT Inference Server - 800 FPS
TENSORRT INFERENCE SERVER
METRICS FOR AUTOSCALING
42. 42
● Load multiple models on every GPU
● Load is evenly distributed between all GPUs
● Spike in requests for blue model
● Each GPU can run the blue model concurrently
● Metrics to indicate time to scale up
○ GPU utilization
○ Power usage
○ Inference count
○ Queue time
○ Number of requests/sec
After TensorRT Inference Server - 15,000 FPSAfter TensorRT Inference Server - 5,000 FPS
TENSORRT INFERENCE SERVER
METRICS FOR AUTOSCALING
44. 44
THE BIG PROBLEM IN DATA SCIENCE
All
Data
ETL
Manage Data
Structured
Data Store
Data
Preparation
Training
Model
Training
Visualization
Evaluate
Inference
Deploy
Slow Training Times for
Data Scientists
45. 45
RAPIDS — OPEN GPU DATA SCIENCE
Software Stack
Data Preparation VisualizationModel Training
CUDA
PYTHON
APACHE ARROW
DASK
DEEP LEARNING
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
46. DEPLOYING RAPIDS — FASTER SPEEDS,
REAL WORLD BENEFITS
ML NVIDIA DGX-2
0 1800 3600 5400 7200
DGX-2
100 CPU Nodes
50 CPU Nodes
20 CPU Nodes
0 600 1200 1800 2400 3000 3600
DGX-2
100 CPU Nodes
50 CPU Nodes
20 CPU Nodes
1 Hour
SECONDS
ETL
2 Hours
SECONDS
Fannie Mae
mortgage dataset
Fast Loading w/
cuDF
cuML
xgboost
GPU Visualization
https://www.youtube.com/watch?v=G1kx_7NJJGA&feature=youtu.be&t=4287
49. 49
Containerized Applications
TensorFlow PyTorch MXNet
TensorRT
Inference Server
CUDA RTCUDA RTCUDA RTCUDA RT
Linux Kernel + CUDA Driver
Tuned SW
CUDA RT
Other
Frameworks
and Apps. . .
OPTIMIZED AND UP-TO-DATE
The Top Deep Learning Containers are Tuned and Optimized Monthly to
Deliver Maximum Performance on NVIDIA GPUs
NVIDIA Container
Runtime for Docker
NVIDIA Container
Runtime for Docker
NVIDIA Container
Runtime for Docker
NVIDIA Container
Runtime for Docker
NVIDIA Container
Runtime for Docker
50. 50
MONTHLY IMPROVEMENT
Over 12 months, up to 1.8X improvement with mixed-precision on ResNet-50
https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html
CUDA 10 based TensorFlow