SlideShare una empresa de Scribd logo
1 de 51
Descargar para leer sin conexión
한재근 | Solutions Architect | jahan@nvidia.com
HOW TO USE GPU
FOR DEVELOPING AI
2
YEARLY GRAPHICS CARD RELEASE
https://www.reddit.com/r/pcmasterrace/comments/6xpiex/yearly_graphics_card_releases/
3
NOW WE HAVE VOLTA AND TURING
Tesla, RTX, TITAN, …
Tesla
Titan
GeForce
V100 T4
Tensor Core
Enabled
4
APPS &
FRAMEWORKS
NVIDIA SDK
& LIBRARIES
TESLA UNIVERSAL ACCELERATION PLATFORM
Single Platform Drives Utilization and Productivity
MACHINE LEARNING/ ANALYTICS
cuMLcuDF cuGRAPH
CUDA
DEEP LEARNING
cuDNN cuBLAS CUTLASS NCCL TensorRT
HPC
CuBLAS OpenACCCuFFT
+550
Applications
Amber
NAMD
CUSTOMER
USECASES
CONSUMER INTERNET
Speech Translate Recommender
SCIENTIFIC APPLICATIONS
Molecular
Simulations
Weather
Forecasting
Seismic
Mapping
INDUSTRIAL APPLICATIONS
ManufacturingHealthcare Finance
TESLA GPUs
& SYSTEMS
SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILYVIRTUAL GPU
5
AGENDA
• GPU for Deep Learning
• Mixed precision training
• Inference optimization
• GPU for Machine Learning / Analytics
• Deploying your intelligence
6
Training
Device
GPU DEEP LEARNING
IS A NEW COMPUTING MODEL
Training
Billions of Trillions of Operations
GPU train larger models, accelerate
time to market
Inference
Datacenter infererence
10s of billions of image, voice, video
queries per day
GPU inference for fast response,
maximize data center throughput
7
TRAINING WITH
TENSORCORE
8
TESLA PLATFORM ENABLES DRAMATIC
REDUCTION IN TIME TO TRAIN
0 20 40 60 80 100 120 140
2x CPU
Single Node
1X P100
Single Node
1X V100
DGX-1
8x V100
At scale
2176x V100
Relative Time to Train Improvements
(ResNet-50)
ResNet-50, 90 epochs to solution | CPU Server: dual socket Intel Xeon Gold 6140
Sony 2176x V100 record on https://nnabla.org/paper/imagenet_in_224sec.pdf
<4 Minutes
3.3 Hours
25 Days
30 Hours
4.8 Days
3.84x
9
TESLA V100
TENSOR CORE GPU
World’s Most Advanced
Data Center GPU
5,120 CUDA cores
640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS
| 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
32 GB HBM2 @ 900GB/s |
300GB/s NVLink
10
WHAT IS TENSOR CORE
Programmable matrix-multiply and accumulate units
• 125 Tensor TFLOPS in V100
• Each core provides 4x4x4 matrix processing
• Programmable using CUDA WMMA, CUDNN, CUBLAS
• Optimized NHWC Tensor Layout
D = A * B + C
https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/
11
PERFORMANCE CONTRIBUTION
Using half precision and Volta your networks can be:
1. 2-4x faster
2. half the size
3. just as powerful
with no architecture change.
12
TRAINING WITH HALF PRECISION
Issue in FP16 Dynamic Range
[2-149, ~2128]
[~1.4e-45, ~3.4e38]
[2-24, ~215 ]
[~5.96e-8, 65,504]
13
A MIXED PRECISION SOLUTION
Imprecise weight updates
Gradients underflow
Reductions overflow
“master” weights in FP32
Loss (Gladient) Scaling
Accumulate to FP32
14
FP32 TRAINING
FP32
Weights
Forward
Pass
Backprop
FP32
Gradients
Apply
FP32 Loss
15
MIXED SOLUTION 1: FP32 MASTER WEIGHTS
FP32 Master
Weights
Forward
Pass
Backprop
FP32 Master
Gradients
FP16
Gradients
FP16
Weights
Apply
Cast toFP32
Copy
Cast to FP16
FP16 Loss
!?
16
GRADIENTS MAY UNDERFLOW
Shift by scaling
https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/
17
MIXED PRECISION TRAINING
FP32 Master
Weights
Forward
Pass
FP16
Loss
Scaled
FP16
Loss Backprop
Scaled
FP32
Gradients
FP32 Gradients
Scaled
FP16
Gradients
FP16
Weights
/ loss_scale,
updateApply
Cast tofp32
Copy
* loss_scale
18
REDUCTION MAY OVERFLOW
a = torch.cuda.HalfTensor(4094).fill(4.0)
a.norm(2, dim=0)
a = torch.cuda.HalfTensor(4095).fill(4.0)
a.norm(2, dim=0)
256
INF
Reductions like norm overflow if > 65,504 is encountered
19
MIXED PRECISION TRAINING
FP32 Master
Weights
Forward
Pass
FP32
Loss
Scaled
FP16
Loss Backprop
Scaled
FP32
Gradients
FP32 Gradients
Scaled
FP16
Gradients
FP16
Weights
/ loss_scale,
updateApply
Cast tofp32
Copy
cast to
fp16
* loss_scale
20
DON’T PANIC
all deep learning frameworks support since
September 2017 (CUDA 9.0)
Image Source: http://gtcarlot.com/data/Audi/R8/2012/54378351/Transmission-56139263.html
21
TENSORFLOW EXAMPLE
def build_forward_model(inputs):
_, _, h, w = inputs.get_shape().as_list()
top_layer = inputs
top_layer = tf.layers.conv2d(top_layer, 64, 7, use_bias=False,
data_format='channels_first', padding='SAME’)
top_layer = tf.contrib.layers.batch_norm(top_layer, 
data_format='NCHW', fused=True)
top_layer = tf.layers.max_pooling2d(top_layer, 2, 2, data_format='channels_first’)
top_layer = tf.reshape(top_layer, (-1, 64 * (h // 2) * (w // 2)))
top_layer = tf.layers.dense(top_layer, 128, activation=tf.nn.relu)
return top_layer
Tensor Core
FP16 operation
Mixed Precision ops
22
TENSORFLOW EXAMPLE
def build_training_model(inputs, labels, nlabel):
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
gradvars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(gradvars)
return inputs, labels, loss, train_op
FP32 Training
23
TENSORFLOW EXAMPLE
def build_training_model(inputs, labels, nlabel):
inputs = tf.cast(inputs, tf.float16)
with tf.variable_scope('fp32_vars’,
custom_getter=float32_variable_storage_getter):
top_layer = build_forward_model(inputs)
logits = tf.layers.dense(top_layer, nlabel, activation=None)
logits = tf.cast(logits, tf.float32)
loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels)
optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9)
loss_scale = 128.0 # Value may need tuning depending on the model
gradients, variables = zip(*optimizer.compute_gradients(loss * loss_scale))
gradients = [grad / loss_scale for grad in gradients]
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
train_op = optimizer.apply_gradients(zip(gradients, variables))
return inputs, labels, loss, train_op
Casting FP16 for input
Casting FP32 for loss
Master weights in FP32
Scaling loss and
get gradients in FP16
Loss scale down with FP32
Gradients clipping (opt)
24
N, D_in, D_out = 64, 1000, 10
x = Variable(torch.randn(N, D_in )).cuda()
y = Variable(torch.randn(N, D_out)).cuda()
model = torch.nn.Linear(D_in, D_out).cuda()
model_params, master_params = prep_param_list(model)
optimizer = torch.optim.SGD(master_params, lr=1e-3)
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred)
optimizer.zero_grad()
loss.backward()
optimizer.step()
PyTorch Example
25
N, D_in, D_out = 64, 1000, 10
scale_factor = 128.0
x = Variable(torch.randn(N, D_in )).cuda().half()
y = Variable(torch.randn(N, D_out)).cuda().half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
model_params, master_params = prep_param_list(model)
optimizer = torch.optim.SGD(master_params, lr=1e-3)
for t in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred.float(),y.float())
scaled_loss = scale_factor * loss.float()
model.zero_grad()
loss.backward()
model_grads_to_master_grads(model_params, master_params)
for param in master.params:
param.grad.data.mul_(1./scale_factor)
optimizer.step()
master_params_to_model_params(model_params, master_params)
MIXED SOLUTION: LOSS (GRADIENT) SCALING
Gradients are now rescaled to be representable
The FP32 master gradients must be
"descaled"
loss is now FP32
(Model grads are still FP16)
Casting Weights
from apex.fp16_utils import *
26
from apex.fp16_utils import *
from apex.optimizers import FP16_Optimizer
N, D_in, D_out = 64, 1000, 10
x = Variable(torch.randn(N, D_in )).cuda().half()
y = Variable(torch.randn(N, D_out)).cuda().half()
model = torch.nn.Linear(D_in, D_out).cuda().half()
optimizer = torch.optim.SGD(master_params, lr=1e-3)
if loss_scale == 0:
optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
else:
optimizer = FP16_Optimizer(optimizer, static_loss_scale=loss_scale)
for t in range(500):
optimizer.zero_grad()
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred)
if fp16:
optimizer.backward(loss)
else:
loss.backward()
optimizer.step()
Example
Dynamic loss scale
With FP16_Optimizer
27
CAVEATS
Use float32 for certain ops to avoid overflow or underflow
• Reductions (e.g., norm, softmax)
• Range-expanding math functions (e.g., exp, pow)
Apply loss scaling to avoid gradients underflow
https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/
https://github.com/NVIDIA/DeepLearningExamples
28
PERFORMANCE TRADEOFFS
FP32 Master
Weights
Forward
Pass
FP32
Loss
Scaled
FP16
Loss Backprop
Scaled
FP32
Gradients
FP32 Gradients
Scaled
FP16
Gradients
FP16
Weights
/ loss_scale,
updateApply
Cast tofp32
Copy
cast to
fp16
* loss_scale
Accelerated Overhead
Forward/Backward acceleration vs gradient type casting (fp16-fp32) overhead
29
FURTHER UNDERSTANDING
Mixed Precision Lectures
Training Neural Networks with Mixed Precision: Theory and Practice
http://on-demand.gputechconf.com/gtc/2018/video/S8923/
Training Neural Networks with Mixed Precision: Real Examples
http://on-demand.gputechconf.com/gtc/2018/video/S81012/
Training Neural Networks with Mixed Precision
http://on-demand.gputechconf.com/gtc-taiwan/2018/pdf/5-
1_Internal%20Speaker_Michael%20Carilli_PDF%20For%20Sharing.pdf
30
BERT OPTIMIZATION CASE
https://github.com/google-research/bert/pull/255
31
DOES IT WORKS?
NVIDIA Profiler for the detail
volta_s884cudnn_fp16_64x64_sliced1x4_ldg8_wgrad_idx_exp_interior_nhwc_nt
nvprof -t 20 --print-gpu-trace python main.py --fp16 /datasets
Architecture identifier
library
precision
Applied size
32
INFERENCE
33
GPU INFERENCE ADOPTION IS ACCELERATING
60X Latency Improvement
Real-Time Search
12X Faster Inference
Live Video Analysis
40X Higher Performance
Real-Time Brand Impact
Tesla P4, TensorRT Adoption
Use Cases
VISUAL SEARCH VIDEO ANALYSIS ADVERTISING INFERENCE USE CASES
Video
MapsImage
NLP
Speech
Search
34
Kernel
Auto-Tuning
Layer &
Tensor Fusion
Dynamic
Tensor
Memory
Precision
Calibration
NVIDIA TensorRT 5
Inference Optimizer and Runtime
developer.nvidia.com/tensorrt
Data center, embedded & automotive
In-framework support for TensorFlow
Support for all other frameworks and ONNX
TensorRT inference server microservice with Docker and
Kubernetes integration
New layers and APIs
New OS support for Windows and CentOS
DRIVE PX 2
JETSON TX2
NVIDIA DLA
TESLA P4/T4
TESLA V100
FRAMEWORKS GPU PLATFORMS
TensorRT
Optimizer Runtime
*New in TRT5
35
BREAKTHROUGH RESNET-50
INFERENCE PERFORMANCE
4,365
6,379
0
1500
3000
4500
6000
7500
THROUGHPUT
Tesla T4 Tesla V100
63
22
0
16
32
48
64
80
ENERGY EFFICIENCY
1
0.89
0.8
0.85
0.9
0.95
1
1.05
LATENCY
GPU: Dual-Socket Xeon Gold 6140@3.6GHz with GPUs as shown 18.11-py3 |
TensorRT 5.0 | T4: INT8, V100: Mixed | Batch Size = 128
GPU: Dual-Socket Xeon Gold 6140@3.6GHz with GPUs as shown
18.11-py3 | TensorRT 5.0 | INT8 | Batch Size = 1
GPU: Dual-Socket Xeon Gold 6140@3.6GHz @3.6GHz with GPUs as shown
18.11-py3 | TensorRT 5.0 | T4: INT8, V100: Mixed | Batch Size = 128
Tesla T4 Tesla V100 Tesla T4 Tesla V100
img/s Milliseconds – lower is better img/s/watt
36
BREAKTHROUGH NMT
INFERENCE PERFORMANCE
34,122
67,124
0
15000
30000
45000
60000
75000
THROUGHPUT
Tesla T4 Tesla V100
763
924
0
200
400
600
800
1000
ENERGY EFFICIENCY
21
14
0
5
10
15
20
25
LATENCY
GPU: Dual-Socket Xeon E5-2698 v4@3.6GHz with GPU servers as shown |
18.11-py3 | TensorRT 5.0 | Mixed | Batch Size = 128
GPU: Dual-Socket Xeon E5-2698 v4@3.6GHz with GPU servers as shown |
18.11-py3 | TensorRT 5.0 | Mixed | Batch Size = 1
GPU: Dual-Socket Xeon E5-2698 v4@3.6GHz with GPU servers as shown |
18.11-py3 | TensorRT 5.0 | Mixed | Batch Size = 128
Tesla T4 Tesla V100 Tesla T4 Tesla V100
tokens/s Milliseconds – lower is better tokens/s/watt
37
INFERENCE SERVER ARCHITECTURE
Models supported
● TensorFlow GraphDef/SavedModel
● TensorFlow and TensorRT GraphDef
● TensorRT Plans
● Caffe2 NetDef (ONNX import)
Multi-GPU support
Concurrent model execution
Server HTTP REST API/gRPC
Python/C++ client libraries
Python/C++ Client Library
Available with Monthly Updates
38
FLEXIBLE MODEL DEPLOYMENT TO BALANCE
CONVENIENCE WITH PERFORMANCE
Native Framework
• Minimal conversion
• Least performant
• Support CPU and GPU
Framework +
TensorRT Runtime
• Some conversion
• Framework fallback for
unsupported layers
• TensorRT performance
with FP16 and INT8
• Supports GPU
TensorRT Runtime
• More conversion
• Most performant with low
memory footprint
• Precision control FP16
and INT8
• Supports GPU
More performance
+
Less conversion
39
GREAT PERFORMANCE FOR MULTIPLE
MODEL DEPLOYMENTS OF RN50
RN50 with 50ms latency SLA
across various deployments
• CPU: TensorFlow FP32
• GPU - V100 16GB: TensorFlow
FP32
• GPU - V100 16GB: TensorRT
FP16
40
INFERENCE SERVER ARCHITECTURE
KUBEFLOW
Kubeflow guest blog: https://www.kubeflow.org/blog/nvidia_tensorrt/
41
● One model per GPU
● Requests are steady across all models
● Utilization is low on all GPUs
● Spike in requests for blue model
● GPUs running blue model are being fully utilized
● Other GPUs remain underutilized
Before TensorRT Inference Server - 5,000 FPSBefore TensorRT Inference Server - 800 FPS
TENSORRT INFERENCE SERVER
METRICS FOR AUTOSCALING
42
● Load multiple models on every GPU
● Load is evenly distributed between all GPUs
● Spike in requests for blue model
● Each GPU can run the blue model concurrently
● Metrics to indicate time to scale up
○ GPU utilization
○ Power usage
○ Inference count
○ Queue time
○ Number of requests/sec
After TensorRT Inference Server - 15,000 FPSAfter TensorRT Inference Server - 5,000 FPS
TENSORRT INFERENCE SERVER
METRICS FOR AUTOSCALING
43
MACHINE LEARNING /
ANALYTICS
44
THE BIG PROBLEM IN DATA SCIENCE
All
Data
ETL
Manage Data
Structured
Data Store
Data
Preparation
Training
Model
Training
Visualization
Evaluate
Inference
Deploy
Slow Training Times for
Data Scientists
45
RAPIDS — OPEN GPU DATA SCIENCE
Software Stack
Data Preparation VisualizationModel Training
CUDA
PYTHON
APACHE ARROW
DASK
DEEP LEARNING
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
DEPLOYING RAPIDS — FASTER SPEEDS,
REAL WORLD BENEFITS
ML NVIDIA DGX-2
0 1800 3600 5400 7200
DGX-2
100 CPU Nodes
50 CPU Nodes
20 CPU Nodes
0 600 1200 1800 2400 3000 3600
DGX-2
100 CPU Nodes
50 CPU Nodes
20 CPU Nodes
1 Hour
SECONDS
ETL
2 Hours
SECONDS
Fannie Mae
mortgage dataset
Fast Loading w/
cuDF
cuML
xgboost
GPU Visualization
https://www.youtube.com/watch?v=G1kx_7NJJGA&feature=youtu.be&t=4287
47
NVIDIA GPU CLOUD
48
A CONSISTENT, HYBRID CLOUD EXPERIENCE
ACROSS COMPUTE PLATFORMS
49
Containerized Applications
TensorFlow PyTorch MXNet
TensorRT
Inference Server
CUDA RTCUDA RTCUDA RTCUDA RT
Linux Kernel + CUDA Driver
Tuned SW
CUDA RT
Other
Frameworks
and Apps. . .
OPTIMIZED AND UP-TO-DATE
The Top Deep Learning Containers are Tuned and Optimized Monthly to
Deliver Maximum Performance on NVIDIA GPUs
NVIDIA Container
Runtime for Docker
NVIDIA Container
Runtime for Docker
NVIDIA Container
Runtime for Docker
NVIDIA Container
Runtime for Docker
NVIDIA Container
Runtime for Docker
50
MONTHLY IMPROVEMENT
Over 12 months, up to 1.8X improvement with mixed-precision on ResNet-50
https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html
CUDA 10 based TensorFlow
한재근 | jahan@nvidia.com

Más contenido relacionado

La actualidad más candente

Rapport de projet de fin d'études - SIEMENS 2016
Rapport de projet de fin d'études - SIEMENS 2016Rapport de projet de fin d'études - SIEMENS 2016
Rapport de projet de fin d'études - SIEMENS 2016Soufiane KALLIDA
 
029-3 - CONCEPTION PIECES PLASTIQUE 2010.ppt
029-3 - CONCEPTION PIECES PLASTIQUE 2010.ppt029-3 - CONCEPTION PIECES PLASTIQUE 2010.ppt
029-3 - CONCEPTION PIECES PLASTIQUE 2010.pptChokriGadri1
 
Dcs lec02 - z-transform
Dcs   lec02 - z-transformDcs   lec02 - z-transform
Dcs lec02 - z-transformAmr E. Mohamed
 
pfe_final.pptx
pfe_final.pptxpfe_final.pptx
pfe_final.pptxhani911563
 
Soutenance de stage Ingénieur
Soutenance de stage IngénieurSoutenance de stage Ingénieur
Soutenance de stage IngénieurFaten Chalbi
 
47811458 exercices-systemes-echantillonnes
47811458 exercices-systemes-echantillonnes47811458 exercices-systemes-echantillonnes
47811458 exercices-systemes-echantillonnesTRIKI BILEL
 
Analyse fonctionnelle: tronc commun technologique
Analyse fonctionnelle: tronc commun technologiqueAnalyse fonctionnelle: tronc commun technologique
Analyse fonctionnelle: tronc commun technologiquemariya808
 
Développement d’une Application Mobile Android StreetArtPlanet
Développement d’une Application Mobile Android StreetArtPlanetDéveloppement d’une Application Mobile Android StreetArtPlanet
Développement d’une Application Mobile Android StreetArtPlanet Slim Namouchi
 
Convolution linear and circular using z transform day 5
Convolution   linear and circular using z transform day 5Convolution   linear and circular using z transform day 5
Convolution linear and circular using z transform day 5vijayanand Kandaswamy
 
Modélisation d'un système de prévention des incendies
Modélisation d'un système de prévention des incendiesModélisation d'un système de prévention des incendies
Modélisation d'un système de prévention des incendiesMEJDAOUI Soufiane
 
敏捷开发技术最佳实践(统一敏捷开发过程)
敏捷开发技术最佳实践(统一敏捷开发过程)敏捷开发技术最佳实践(统一敏捷开发过程)
敏捷开发技术最佳实践(统一敏捷开发过程)Weijun Zhong
 
Damped force vibrating Model Laplace Transforms
Damped force vibrating Model Laplace Transforms Damped force vibrating Model Laplace Transforms
Damped force vibrating Model Laplace Transforms Student
 

La actualidad más candente (20)

Rapport de projet de fin d'études - SIEMENS 2016
Rapport de projet de fin d'études - SIEMENS 2016Rapport de projet de fin d'études - SIEMENS 2016
Rapport de projet de fin d'études - SIEMENS 2016
 
Properties of Fourier transform
Properties of Fourier transformProperties of Fourier transform
Properties of Fourier transform
 
Cours robotique
Cours robotiqueCours robotique
Cours robotique
 
Fourier transform
Fourier transformFourier transform
Fourier transform
 
029-3 - CONCEPTION PIECES PLASTIQUE 2010.ppt
029-3 - CONCEPTION PIECES PLASTIQUE 2010.ppt029-3 - CONCEPTION PIECES PLASTIQUE 2010.ppt
029-3 - CONCEPTION PIECES PLASTIQUE 2010.ppt
 
Dcs lec02 - z-transform
Dcs   lec02 - z-transformDcs   lec02 - z-transform
Dcs lec02 - z-transform
 
pfe_final.pptx
pfe_final.pptxpfe_final.pptx
pfe_final.pptx
 
Soutenance de stage Ingénieur
Soutenance de stage IngénieurSoutenance de stage Ingénieur
Soutenance de stage Ingénieur
 
47811458 exercices-systemes-echantillonnes
47811458 exercices-systemes-echantillonnes47811458 exercices-systemes-echantillonnes
47811458 exercices-systemes-echantillonnes
 
Lista de precios etex
Lista de precios etexLista de precios etex
Lista de precios etex
 
Analyse fonctionnelle: tronc commun technologique
Analyse fonctionnelle: tronc commun technologiqueAnalyse fonctionnelle: tronc commun technologique
Analyse fonctionnelle: tronc commun technologique
 
IHM
IHMIHM
IHM
 
Deep learning
Deep learningDeep learning
Deep learning
 
Développement d’une Application Mobile Android StreetArtPlanet
Développement d’une Application Mobile Android StreetArtPlanetDéveloppement d’une Application Mobile Android StreetArtPlanet
Développement d’une Application Mobile Android StreetArtPlanet
 
Convolution linear and circular using z transform day 5
Convolution   linear and circular using z transform day 5Convolution   linear and circular using z transform day 5
Convolution linear and circular using z transform day 5
 
Scrum
ScrumScrum
Scrum
 
Modélisation d'un système de prévention des incendies
Modélisation d'un système de prévention des incendiesModélisation d'un système de prévention des incendies
Modélisation d'un système de prévention des incendies
 
Présentation pfe finale
Présentation pfe finalePrésentation pfe finale
Présentation pfe finale
 
敏捷开发技术最佳实践(统一敏捷开发过程)
敏捷开发技术最佳实践(统一敏捷开发过程)敏捷开发技术最佳实践(统一敏捷开发过程)
敏捷开发技术最佳实践(统一敏捷开发过程)
 
Damped force vibrating Model Laplace Transforms
Damped force vibrating Model Laplace Transforms Damped force vibrating Model Laplace Transforms
Damped force vibrating Model Laplace Transforms
 

Similar a JMI Techtalk: 한재근 - How to use GPU for developing AI

Accelerated Training of Transformer Models
Accelerated Training of Transformer ModelsAccelerated Training of Transformer Models
Accelerated Training of Transformer ModelsDatabricks
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidiaMail.ru Group
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
OpenTuesday: Neues aus der RRDtool Welt
OpenTuesday: Neues aus der RRDtool WeltOpenTuesday: Neues aus der RRDtool Welt
OpenTuesday: Neues aus der RRDtool WeltDigicomp Academy AG
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUsiguazio
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdwKohei KaiGai
 
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...Naoki (Neo) SATO
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Craig Chao
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfDLow6
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
 
High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1NVIDIA
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Databricks
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 

Similar a JMI Techtalk: 한재근 - How to use GPU for developing AI (20)

Accelerated Training of Transformer Models
Accelerated Training of Transformer ModelsAccelerated Training of Transformer Models
Accelerated Training of Transformer Models
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
OpenTuesday: Neues aus der RRDtool Welt
OpenTuesday: Neues aus der RRDtool WeltOpenTuesday: Neues aus der RRDtool Welt
OpenTuesday: Neues aus der RRDtool Welt
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw
 
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 

Más de Lablup Inc.

Lablupconf session1-2 "거대한 백엔드에 벽돌 끼워넣기"
Lablupconf session1-2 "거대한 백엔드에 벽돌 끼워넣기"Lablupconf session1-2 "거대한 백엔드에 벽돌 끼워넣기"
Lablupconf session1-2 "거대한 백엔드에 벽돌 끼워넣기"Lablup Inc.
 
Lablupconf session8 "Paving the road to AI-powered world"
Lablupconf session8 "Paving the road to AI-powered world"Lablupconf session8 "Paving the road to AI-powered world"
Lablupconf session8 "Paving the road to AI-powered world"Lablup Inc.
 
Lablupconf session7 People don't know what they want until LABLUP show it to ...
Lablupconf session7 People don't know what they want until LABLUP show it to ...Lablupconf session7 People don't know what they want until LABLUP show it to ...
Lablupconf session7 People don't know what they want until LABLUP show it to ...Lablup Inc.
 
Lablupconf session6 "IoT에서 BI까지, 조선소 ML 파이프라인 만들기"
Lablupconf session6 "IoT에서 BI까지, 조선소 ML 파이프라인 만들기"Lablupconf session6 "IoT에서 BI까지, 조선소 ML 파이프라인 만들기"
Lablupconf session6 "IoT에서 BI까지, 조선소 ML 파이프라인 만들기"Lablup Inc.
 
Lablupconf session3 "Application of DL in fight against COVID-19(EN)"
Lablupconf session3 "Application of DL in fight against COVID-19(EN)"Lablupconf session3 "Application of DL in fight against COVID-19(EN)"
Lablupconf session3 "Application of DL in fight against COVID-19(EN)"Lablup Inc.
 
Lablupconf session5 "Application of machine learning to classify normal and d...
Lablupconf session5 "Application of machine learning to classify normal and d...Lablupconf session5 "Application of machine learning to classify normal and d...
Lablupconf session5 "Application of machine learning to classify normal and d...Lablup Inc.
 
Lablupconf session4 "스토리지 솔루션 입출력 파이프라인 가속화와 개발 범위 간의 균형 잡기"
Lablupconf session4 "스토리지 솔루션 입출력 파이프라인 가속화와 개발 범위 간의 균형 잡기"Lablupconf session4 "스토리지 솔루션 입출력 파이프라인 가속화와 개발 범위 간의 균형 잡기"
Lablupconf session4 "스토리지 솔루션 입출력 파이프라인 가속화와 개발 범위 간의 균형 잡기"Lablup Inc.
 
Lablupconf session2 "MLOps를 활용한 AI빅데이터 교육 사례"
Lablupconf session2 "MLOps를 활용한 AI빅데이터 교육 사례"Lablupconf session2 "MLOps를 활용한 AI빅데이터 교육 사례"
Lablupconf session2 "MLOps를 활용한 AI빅데이터 교육 사례"Lablup Inc.
 
Lablupconf session1-1 "Lablup과 함께하는 컨트리뷰션 아카데미" - 김수진
Lablupconf session1-1 "Lablup과 함께하는 컨트리뷰션 아카데미" - 김수진Lablupconf session1-1 "Lablup과 함께하는 컨트리뷰션 아카데미" - 김수진
Lablupconf session1-1 "Lablup과 함께하는 컨트리뷰션 아카데미" - 김수진Lablup Inc.
 
Lablupconf keynote
Lablupconf keynoteLablupconf keynote
Lablupconf keynoteLablup Inc.
 
초심자를 위한 무작정 시작하는 Backend.AI-04
초심자를 위한 무작정 시작하는 Backend.AI-04초심자를 위한 무작정 시작하는 Backend.AI-04
초심자를 위한 무작정 시작하는 Backend.AI-04Lablup Inc.
 
초심자를 위한 무작정 시작하는 Backend.AI-03
초심자를 위한 무작정 시작하는 Backend.AI-03초심자를 위한 무작정 시작하는 Backend.AI-03
초심자를 위한 무작정 시작하는 Backend.AI-03Lablup Inc.
 
Backend.ai tutorial-2ndweek
Backend.ai tutorial-2ndweekBackend.ai tutorial-2ndweek
Backend.ai tutorial-2ndweekLablup Inc.
 
Backend.ai tutorial-01
Backend.ai tutorial-01Backend.ai tutorial-01
Backend.ai tutorial-01Lablup Inc.
 
Backend.AI: Brochure (2019 Autumn / 19.09)
Backend.AI: Brochure (2019 Autumn / 19.09)Backend.AI: Brochure (2019 Autumn / 19.09)
Backend.AI: Brochure (2019 Autumn / 19.09)Lablup Inc.
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
JMI Techtalk: 강재욱 - Toward tf.keras from tf.estimator - From TensorFlow 2.0 p...
JMI Techtalk: 강재욱 - Toward tf.keras from tf.estimator - From TensorFlow 2.0 p...JMI Techtalk: 강재욱 - Toward tf.keras from tf.estimator - From TensorFlow 2.0 p...
JMI Techtalk: 강재욱 - Toward tf.keras from tf.estimator - From TensorFlow 2.0 p...Lablup Inc.
 
JMI Techtalk : Backend.AI
JMI Techtalk : Backend.AIJMI Techtalk : Backend.AI
JMI Techtalk : Backend.AILablup Inc.
 
Backend.AI: 왜 우리는 우리 핵심 제품을 오픈소스화 했는가
Backend.AI: 왜 우리는 우리 핵심 제품을 오픈소스화 했는가Backend.AI: 왜 우리는 우리 핵심 제품을 오픈소스화 했는가
Backend.AI: 왜 우리는 우리 핵심 제품을 오픈소스화 했는가Lablup Inc.
 

Más de Lablup Inc. (19)

Lablupconf session1-2 "거대한 백엔드에 벽돌 끼워넣기"
Lablupconf session1-2 "거대한 백엔드에 벽돌 끼워넣기"Lablupconf session1-2 "거대한 백엔드에 벽돌 끼워넣기"
Lablupconf session1-2 "거대한 백엔드에 벽돌 끼워넣기"
 
Lablupconf session8 "Paving the road to AI-powered world"
Lablupconf session8 "Paving the road to AI-powered world"Lablupconf session8 "Paving the road to AI-powered world"
Lablupconf session8 "Paving the road to AI-powered world"
 
Lablupconf session7 People don't know what they want until LABLUP show it to ...
Lablupconf session7 People don't know what they want until LABLUP show it to ...Lablupconf session7 People don't know what they want until LABLUP show it to ...
Lablupconf session7 People don't know what they want until LABLUP show it to ...
 
Lablupconf session6 "IoT에서 BI까지, 조선소 ML 파이프라인 만들기"
Lablupconf session6 "IoT에서 BI까지, 조선소 ML 파이프라인 만들기"Lablupconf session6 "IoT에서 BI까지, 조선소 ML 파이프라인 만들기"
Lablupconf session6 "IoT에서 BI까지, 조선소 ML 파이프라인 만들기"
 
Lablupconf session3 "Application of DL in fight against COVID-19(EN)"
Lablupconf session3 "Application of DL in fight against COVID-19(EN)"Lablupconf session3 "Application of DL in fight against COVID-19(EN)"
Lablupconf session3 "Application of DL in fight against COVID-19(EN)"
 
Lablupconf session5 "Application of machine learning to classify normal and d...
Lablupconf session5 "Application of machine learning to classify normal and d...Lablupconf session5 "Application of machine learning to classify normal and d...
Lablupconf session5 "Application of machine learning to classify normal and d...
 
Lablupconf session4 "스토리지 솔루션 입출력 파이프라인 가속화와 개발 범위 간의 균형 잡기"
Lablupconf session4 "스토리지 솔루션 입출력 파이프라인 가속화와 개발 범위 간의 균형 잡기"Lablupconf session4 "스토리지 솔루션 입출력 파이프라인 가속화와 개발 범위 간의 균형 잡기"
Lablupconf session4 "스토리지 솔루션 입출력 파이프라인 가속화와 개발 범위 간의 균형 잡기"
 
Lablupconf session2 "MLOps를 활용한 AI빅데이터 교육 사례"
Lablupconf session2 "MLOps를 활용한 AI빅데이터 교육 사례"Lablupconf session2 "MLOps를 활용한 AI빅데이터 교육 사례"
Lablupconf session2 "MLOps를 활용한 AI빅데이터 교육 사례"
 
Lablupconf session1-1 "Lablup과 함께하는 컨트리뷰션 아카데미" - 김수진
Lablupconf session1-1 "Lablup과 함께하는 컨트리뷰션 아카데미" - 김수진Lablupconf session1-1 "Lablup과 함께하는 컨트리뷰션 아카데미" - 김수진
Lablupconf session1-1 "Lablup과 함께하는 컨트리뷰션 아카데미" - 김수진
 
Lablupconf keynote
Lablupconf keynoteLablupconf keynote
Lablupconf keynote
 
초심자를 위한 무작정 시작하는 Backend.AI-04
초심자를 위한 무작정 시작하는 Backend.AI-04초심자를 위한 무작정 시작하는 Backend.AI-04
초심자를 위한 무작정 시작하는 Backend.AI-04
 
초심자를 위한 무작정 시작하는 Backend.AI-03
초심자를 위한 무작정 시작하는 Backend.AI-03초심자를 위한 무작정 시작하는 Backend.AI-03
초심자를 위한 무작정 시작하는 Backend.AI-03
 
Backend.ai tutorial-2ndweek
Backend.ai tutorial-2ndweekBackend.ai tutorial-2ndweek
Backend.ai tutorial-2ndweek
 
Backend.ai tutorial-01
Backend.ai tutorial-01Backend.ai tutorial-01
Backend.ai tutorial-01
 
Backend.AI: Brochure (2019 Autumn / 19.09)
Backend.AI: Brochure (2019 Autumn / 19.09)Backend.AI: Brochure (2019 Autumn / 19.09)
Backend.AI: Brochure (2019 Autumn / 19.09)
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
JMI Techtalk: 강재욱 - Toward tf.keras from tf.estimator - From TensorFlow 2.0 p...
JMI Techtalk: 강재욱 - Toward tf.keras from tf.estimator - From TensorFlow 2.0 p...JMI Techtalk: 강재욱 - Toward tf.keras from tf.estimator - From TensorFlow 2.0 p...
JMI Techtalk: 강재욱 - Toward tf.keras from tf.estimator - From TensorFlow 2.0 p...
 
JMI Techtalk : Backend.AI
JMI Techtalk : Backend.AIJMI Techtalk : Backend.AI
JMI Techtalk : Backend.AI
 
Backend.AI: 왜 우리는 우리 핵심 제품을 오픈소스화 했는가
Backend.AI: 왜 우리는 우리 핵심 제품을 오픈소스화 했는가Backend.AI: 왜 우리는 우리 핵심 제품을 오픈소스화 했는가
Backend.AI: 왜 우리는 우리 핵심 제품을 오픈소스화 했는가
 

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

JMI Techtalk: 한재근 - How to use GPU for developing AI

  • 1. 한재근 | Solutions Architect | jahan@nvidia.com HOW TO USE GPU FOR DEVELOPING AI
  • 2. 2 YEARLY GRAPHICS CARD RELEASE https://www.reddit.com/r/pcmasterrace/comments/6xpiex/yearly_graphics_card_releases/
  • 3. 3 NOW WE HAVE VOLTA AND TURING Tesla, RTX, TITAN, … Tesla Titan GeForce V100 T4 Tensor Core Enabled
  • 4. 4 APPS & FRAMEWORKS NVIDIA SDK & LIBRARIES TESLA UNIVERSAL ACCELERATION PLATFORM Single Platform Drives Utilization and Productivity MACHINE LEARNING/ ANALYTICS cuMLcuDF cuGRAPH CUDA DEEP LEARNING cuDNN cuBLAS CUTLASS NCCL TensorRT HPC CuBLAS OpenACCCuFFT +550 Applications Amber NAMD CUSTOMER USECASES CONSUMER INTERNET Speech Translate Recommender SCIENTIFIC APPLICATIONS Molecular Simulations Weather Forecasting Seismic Mapping INDUSTRIAL APPLICATIONS ManufacturingHealthcare Finance TESLA GPUs & SYSTEMS SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILYVIRTUAL GPU
  • 5. 5 AGENDA • GPU for Deep Learning • Mixed precision training • Inference optimization • GPU for Machine Learning / Analytics • Deploying your intelligence
  • 6. 6 Training Device GPU DEEP LEARNING IS A NEW COMPUTING MODEL Training Billions of Trillions of Operations GPU train larger models, accelerate time to market Inference Datacenter infererence 10s of billions of image, voice, video queries per day GPU inference for fast response, maximize data center throughput
  • 8. 8 TESLA PLATFORM ENABLES DRAMATIC REDUCTION IN TIME TO TRAIN 0 20 40 60 80 100 120 140 2x CPU Single Node 1X P100 Single Node 1X V100 DGX-1 8x V100 At scale 2176x V100 Relative Time to Train Improvements (ResNet-50) ResNet-50, 90 epochs to solution | CPU Server: dual socket Intel Xeon Gold 6140 Sony 2176x V100 record on https://nnabla.org/paper/imagenet_in_224sec.pdf <4 Minutes 3.3 Hours 25 Days 30 Hours 4.8 Days 3.84x
  • 9. 9 TESLA V100 TENSOR CORE GPU World’s Most Advanced Data Center GPU 5,120 CUDA cores 640 NEW Tensor cores 7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS 20MB SM RF | 16MB Cache 32 GB HBM2 @ 900GB/s | 300GB/s NVLink
  • 10. 10 WHAT IS TENSOR CORE Programmable matrix-multiply and accumulate units • 125 Tensor TFLOPS in V100 • Each core provides 4x4x4 matrix processing • Programmable using CUDA WMMA, CUDNN, CUBLAS • Optimized NHWC Tensor Layout D = A * B + C https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/
  • 11. 11 PERFORMANCE CONTRIBUTION Using half precision and Volta your networks can be: 1. 2-4x faster 2. half the size 3. just as powerful with no architecture change.
  • 12. 12 TRAINING WITH HALF PRECISION Issue in FP16 Dynamic Range [2-149, ~2128] [~1.4e-45, ~3.4e38] [2-24, ~215 ] [~5.96e-8, 65,504]
  • 13. 13 A MIXED PRECISION SOLUTION Imprecise weight updates Gradients underflow Reductions overflow “master” weights in FP32 Loss (Gladient) Scaling Accumulate to FP32
  • 15. 15 MIXED SOLUTION 1: FP32 MASTER WEIGHTS FP32 Master Weights Forward Pass Backprop FP32 Master Gradients FP16 Gradients FP16 Weights Apply Cast toFP32 Copy Cast to FP16 FP16 Loss !?
  • 16. 16 GRADIENTS MAY UNDERFLOW Shift by scaling https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/
  • 17. 17 MIXED PRECISION TRAINING FP32 Master Weights Forward Pass FP16 Loss Scaled FP16 Loss Backprop Scaled FP32 Gradients FP32 Gradients Scaled FP16 Gradients FP16 Weights / loss_scale, updateApply Cast tofp32 Copy * loss_scale
  • 18. 18 REDUCTION MAY OVERFLOW a = torch.cuda.HalfTensor(4094).fill(4.0) a.norm(2, dim=0) a = torch.cuda.HalfTensor(4095).fill(4.0) a.norm(2, dim=0) 256 INF Reductions like norm overflow if > 65,504 is encountered
  • 19. 19 MIXED PRECISION TRAINING FP32 Master Weights Forward Pass FP32 Loss Scaled FP16 Loss Backprop Scaled FP32 Gradients FP32 Gradients Scaled FP16 Gradients FP16 Weights / loss_scale, updateApply Cast tofp32 Copy cast to fp16 * loss_scale
  • 20. 20 DON’T PANIC all deep learning frameworks support since September 2017 (CUDA 9.0) Image Source: http://gtcarlot.com/data/Audi/R8/2012/54378351/Transmission-56139263.html
  • 21. 21 TENSORFLOW EXAMPLE def build_forward_model(inputs): _, _, h, w = inputs.get_shape().as_list() top_layer = inputs top_layer = tf.layers.conv2d(top_layer, 64, 7, use_bias=False, data_format='channels_first', padding='SAME’) top_layer = tf.contrib.layers.batch_norm(top_layer, data_format='NCHW', fused=True) top_layer = tf.layers.max_pooling2d(top_layer, 2, 2, data_format='channels_first’) top_layer = tf.reshape(top_layer, (-1, 64 * (h // 2) * (w // 2))) top_layer = tf.layers.dense(top_layer, 128, activation=tf.nn.relu) return top_layer Tensor Core FP16 operation Mixed Precision ops
  • 22. 22 TENSORFLOW EXAMPLE def build_training_model(inputs, labels, nlabel): top_layer = build_forward_model(inputs) logits = tf.layers.dense(top_layer, nlabel, activation=None) loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels) optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9) gradvars = optimizer.compute_gradients(loss) train_op = optimizer.apply_gradients(gradvars) return inputs, labels, loss, train_op FP32 Training
  • 23. 23 TENSORFLOW EXAMPLE def build_training_model(inputs, labels, nlabel): inputs = tf.cast(inputs, tf.float16) with tf.variable_scope('fp32_vars’, custom_getter=float32_variable_storage_getter): top_layer = build_forward_model(inputs) logits = tf.layers.dense(top_layer, nlabel, activation=None) logits = tf.cast(logits, tf.float32) loss = tf.losses.sparse_softmax_cross_entropy(logits=logits, labels=labels) optimizer = tf.train.MomentumOptimizer(learning_rate=0.01, momentum=0.9) loss_scale = 128.0 # Value may need tuning depending on the model gradients, variables = zip(*optimizer.compute_gradients(loss * loss_scale)) gradients = [grad / loss_scale for grad in gradients] gradients, _ = tf.clip_by_global_norm(gradients, 5.0) train_op = optimizer.apply_gradients(zip(gradients, variables)) return inputs, labels, loss, train_op Casting FP16 for input Casting FP32 for loss Master weights in FP32 Scaling loss and get gradients in FP16 Loss scale down with FP32 Gradients clipping (opt)
  • 24. 24 N, D_in, D_out = 64, 1000, 10 x = Variable(torch.randn(N, D_in )).cuda() y = Variable(torch.randn(N, D_out)).cuda() model = torch.nn.Linear(D_in, D_out).cuda() model_params, master_params = prep_param_list(model) optimizer = torch.optim.SGD(master_params, lr=1e-3) for t in range(500): y_pred = model(x) loss = torch.nn.functional.mse_loss(y_pred) optimizer.zero_grad() loss.backward() optimizer.step() PyTorch Example
  • 25. 25 N, D_in, D_out = 64, 1000, 10 scale_factor = 128.0 x = Variable(torch.randn(N, D_in )).cuda().half() y = Variable(torch.randn(N, D_out)).cuda().half() model = torch.nn.Linear(D_in, D_out).cuda().half() model_params, master_params = prep_param_list(model) optimizer = torch.optim.SGD(master_params, lr=1e-3) for t in range(500): y_pred = model(x) loss = torch.nn.functional.mse_loss(y_pred.float(),y.float()) scaled_loss = scale_factor * loss.float() model.zero_grad() loss.backward() model_grads_to_master_grads(model_params, master_params) for param in master.params: param.grad.data.mul_(1./scale_factor) optimizer.step() master_params_to_model_params(model_params, master_params) MIXED SOLUTION: LOSS (GRADIENT) SCALING Gradients are now rescaled to be representable The FP32 master gradients must be "descaled" loss is now FP32 (Model grads are still FP16) Casting Weights from apex.fp16_utils import *
  • 26. 26 from apex.fp16_utils import * from apex.optimizers import FP16_Optimizer N, D_in, D_out = 64, 1000, 10 x = Variable(torch.randn(N, D_in )).cuda().half() y = Variable(torch.randn(N, D_out)).cuda().half() model = torch.nn.Linear(D_in, D_out).cuda().half() optimizer = torch.optim.SGD(master_params, lr=1e-3) if loss_scale == 0: optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True) else: optimizer = FP16_Optimizer(optimizer, static_loss_scale=loss_scale) for t in range(500): optimizer.zero_grad() y_pred = model(x) loss = torch.nn.functional.mse_loss(y_pred) if fp16: optimizer.backward(loss) else: loss.backward() optimizer.step() Example Dynamic loss scale With FP16_Optimizer
  • 27. 27 CAVEATS Use float32 for certain ops to avoid overflow or underflow • Reductions (e.g., norm, softmax) • Range-expanding math functions (e.g., exp, pow) Apply loss scaling to avoid gradients underflow https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/ https://github.com/NVIDIA/DeepLearningExamples
  • 28. 28 PERFORMANCE TRADEOFFS FP32 Master Weights Forward Pass FP32 Loss Scaled FP16 Loss Backprop Scaled FP32 Gradients FP32 Gradients Scaled FP16 Gradients FP16 Weights / loss_scale, updateApply Cast tofp32 Copy cast to fp16 * loss_scale Accelerated Overhead Forward/Backward acceleration vs gradient type casting (fp16-fp32) overhead
  • 29. 29 FURTHER UNDERSTANDING Mixed Precision Lectures Training Neural Networks with Mixed Precision: Theory and Practice http://on-demand.gputechconf.com/gtc/2018/video/S8923/ Training Neural Networks with Mixed Precision: Real Examples http://on-demand.gputechconf.com/gtc/2018/video/S81012/ Training Neural Networks with Mixed Precision http://on-demand.gputechconf.com/gtc-taiwan/2018/pdf/5- 1_Internal%20Speaker_Michael%20Carilli_PDF%20For%20Sharing.pdf
  • 31. 31 DOES IT WORKS? NVIDIA Profiler for the detail volta_s884cudnn_fp16_64x64_sliced1x4_ldg8_wgrad_idx_exp_interior_nhwc_nt nvprof -t 20 --print-gpu-trace python main.py --fp16 /datasets Architecture identifier library precision Applied size
  • 33. 33 GPU INFERENCE ADOPTION IS ACCELERATING 60X Latency Improvement Real-Time Search 12X Faster Inference Live Video Analysis 40X Higher Performance Real-Time Brand Impact Tesla P4, TensorRT Adoption Use Cases VISUAL SEARCH VIDEO ANALYSIS ADVERTISING INFERENCE USE CASES Video MapsImage NLP Speech Search
  • 34. 34 Kernel Auto-Tuning Layer & Tensor Fusion Dynamic Tensor Memory Precision Calibration NVIDIA TensorRT 5 Inference Optimizer and Runtime developer.nvidia.com/tensorrt Data center, embedded & automotive In-framework support for TensorFlow Support for all other frameworks and ONNX TensorRT inference server microservice with Docker and Kubernetes integration New layers and APIs New OS support for Windows and CentOS DRIVE PX 2 JETSON TX2 NVIDIA DLA TESLA P4/T4 TESLA V100 FRAMEWORKS GPU PLATFORMS TensorRT Optimizer Runtime *New in TRT5
  • 35. 35 BREAKTHROUGH RESNET-50 INFERENCE PERFORMANCE 4,365 6,379 0 1500 3000 4500 6000 7500 THROUGHPUT Tesla T4 Tesla V100 63 22 0 16 32 48 64 80 ENERGY EFFICIENCY 1 0.89 0.8 0.85 0.9 0.95 1 1.05 LATENCY GPU: Dual-Socket Xeon Gold 6140@3.6GHz with GPUs as shown 18.11-py3 | TensorRT 5.0 | T4: INT8, V100: Mixed | Batch Size = 128 GPU: Dual-Socket Xeon Gold 6140@3.6GHz with GPUs as shown 18.11-py3 | TensorRT 5.0 | INT8 | Batch Size = 1 GPU: Dual-Socket Xeon Gold 6140@3.6GHz @3.6GHz with GPUs as shown 18.11-py3 | TensorRT 5.0 | T4: INT8, V100: Mixed | Batch Size = 128 Tesla T4 Tesla V100 Tesla T4 Tesla V100 img/s Milliseconds – lower is better img/s/watt
  • 36. 36 BREAKTHROUGH NMT INFERENCE PERFORMANCE 34,122 67,124 0 15000 30000 45000 60000 75000 THROUGHPUT Tesla T4 Tesla V100 763 924 0 200 400 600 800 1000 ENERGY EFFICIENCY 21 14 0 5 10 15 20 25 LATENCY GPU: Dual-Socket Xeon E5-2698 v4@3.6GHz with GPU servers as shown | 18.11-py3 | TensorRT 5.0 | Mixed | Batch Size = 128 GPU: Dual-Socket Xeon E5-2698 v4@3.6GHz with GPU servers as shown | 18.11-py3 | TensorRT 5.0 | Mixed | Batch Size = 1 GPU: Dual-Socket Xeon E5-2698 v4@3.6GHz with GPU servers as shown | 18.11-py3 | TensorRT 5.0 | Mixed | Batch Size = 128 Tesla T4 Tesla V100 Tesla T4 Tesla V100 tokens/s Milliseconds – lower is better tokens/s/watt
  • 37. 37 INFERENCE SERVER ARCHITECTURE Models supported ● TensorFlow GraphDef/SavedModel ● TensorFlow and TensorRT GraphDef ● TensorRT Plans ● Caffe2 NetDef (ONNX import) Multi-GPU support Concurrent model execution Server HTTP REST API/gRPC Python/C++ client libraries Python/C++ Client Library Available with Monthly Updates
  • 38. 38 FLEXIBLE MODEL DEPLOYMENT TO BALANCE CONVENIENCE WITH PERFORMANCE Native Framework • Minimal conversion • Least performant • Support CPU and GPU Framework + TensorRT Runtime • Some conversion • Framework fallback for unsupported layers • TensorRT performance with FP16 and INT8 • Supports GPU TensorRT Runtime • More conversion • Most performant with low memory footprint • Precision control FP16 and INT8 • Supports GPU More performance + Less conversion
  • 39. 39 GREAT PERFORMANCE FOR MULTIPLE MODEL DEPLOYMENTS OF RN50 RN50 with 50ms latency SLA across various deployments • CPU: TensorFlow FP32 • GPU - V100 16GB: TensorFlow FP32 • GPU - V100 16GB: TensorRT FP16
  • 40. 40 INFERENCE SERVER ARCHITECTURE KUBEFLOW Kubeflow guest blog: https://www.kubeflow.org/blog/nvidia_tensorrt/
  • 41. 41 ● One model per GPU ● Requests are steady across all models ● Utilization is low on all GPUs ● Spike in requests for blue model ● GPUs running blue model are being fully utilized ● Other GPUs remain underutilized Before TensorRT Inference Server - 5,000 FPSBefore TensorRT Inference Server - 800 FPS TENSORRT INFERENCE SERVER METRICS FOR AUTOSCALING
  • 42. 42 ● Load multiple models on every GPU ● Load is evenly distributed between all GPUs ● Spike in requests for blue model ● Each GPU can run the blue model concurrently ● Metrics to indicate time to scale up ○ GPU utilization ○ Power usage ○ Inference count ○ Queue time ○ Number of requests/sec After TensorRT Inference Server - 15,000 FPSAfter TensorRT Inference Server - 5,000 FPS TENSORRT INFERENCE SERVER METRICS FOR AUTOSCALING
  • 44. 44 THE BIG PROBLEM IN DATA SCIENCE All Data ETL Manage Data Structured Data Store Data Preparation Training Model Training Visualization Evaluate Inference Deploy Slow Training Times for Data Scientists
  • 45. 45 RAPIDS — OPEN GPU DATA SCIENCE Software Stack Data Preparation VisualizationModel Training CUDA PYTHON APACHE ARROW DASK DEEP LEARNING FRAMEWORKS CUDNN RAPIDS CUMLCUDF CUGRAPH
  • 46. DEPLOYING RAPIDS — FASTER SPEEDS, REAL WORLD BENEFITS ML NVIDIA DGX-2 0 1800 3600 5400 7200 DGX-2 100 CPU Nodes 50 CPU Nodes 20 CPU Nodes 0 600 1200 1800 2400 3000 3600 DGX-2 100 CPU Nodes 50 CPU Nodes 20 CPU Nodes 1 Hour SECONDS ETL 2 Hours SECONDS Fannie Mae mortgage dataset Fast Loading w/ cuDF cuML xgboost GPU Visualization https://www.youtube.com/watch?v=G1kx_7NJJGA&feature=youtu.be&t=4287
  • 48. 48 A CONSISTENT, HYBRID CLOUD EXPERIENCE ACROSS COMPUTE PLATFORMS
  • 49. 49 Containerized Applications TensorFlow PyTorch MXNet TensorRT Inference Server CUDA RTCUDA RTCUDA RTCUDA RT Linux Kernel + CUDA Driver Tuned SW CUDA RT Other Frameworks and Apps. . . OPTIMIZED AND UP-TO-DATE The Top Deep Learning Containers are Tuned and Optimized Monthly to Deliver Maximum Performance on NVIDIA GPUs NVIDIA Container Runtime for Docker NVIDIA Container Runtime for Docker NVIDIA Container Runtime for Docker NVIDIA Container Runtime for Docker NVIDIA Container Runtime for Docker
  • 50. 50 MONTHLY IMPROVEMENT Over 12 months, up to 1.8X improvement with mixed-precision on ResNet-50 https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html CUDA 10 based TensorFlow