From my talk at the Data & AI summit - latest update on the PyTorch Profiler and how you can use it for optimizations for efficiency. Talk also dives into the future and what we need to do together as an industry to move towards Sustainable AI
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Profiling PyTorch for Efficiency & Sustainability
1. P R O F I L I N G P Y T O R C H F O R
E F F I C I E N C Y & S U S T A I N A B I L I T Y
N O V 1 7 , 2 0 2 1
G E E T A C H A U H A N
P Y T O R C H P A R T N E R E N G I N E E R I N G
M E T A A I
2. PyTorch Profile r Talk
A G E N D A
0 1
G P U P E R F O R M A N C E T U N I N G
0 2
P Y T O R C H P R O F I L E R
0 3
T I M E L I N E T R A C I N G
0 4
O P T I M I Z A T I O N E X A M P L E S
0 5
F U R T U R E : S U S T A I N A B L E A I
4. PyTorch Profile r Talk
Optimized for single thread performance
- Majority of chip area is control logic & caches
Complex and deep out-of-order pipelines
- Extract instruction level parallelism
The brain
- Job is to keep the accelerator busy
CPU GPU
Optimized for throughput of data-parallel problems
- Majority of chip area is functional units
Simple, relatively slow in-order pipelines
- Achieves much higher total throughput
Accelerator attached via PCIe
- Order of magnitude faster but off to the side
A DIFFERENT MENTAL MODEL REQUIRED
G P U P E R F O R M A N C E T U N I N G
5. PyTorch Profile r Talk
Composed of Streaming
Multiprocessors (SMs)
Volta V100: 80x SMs
Ampere A100: 108 SMs
DGX A100 with 8 GPUs:
864 SMs vs 128 CPU cores
NVIDIA Volta V100 GPU
G P U P E R F O R M A N C E T U N I N G
6. PyTorch Profile r Talk
G P U P E R F O R M A N C E T U N I N G
64x FP32 units
64x INT, 32x FP64, 32x LD/ST
8x Tensor Cores
5120 (6920 ON A100)
FP32 EXECUTION UNITS
PER GPU
Streaming Multiprocessor
7. PyTorch Profile r Talk
• Excessive CPU/GPU interactions – e.g. for loop launching GPU operations
- Dominated by launch overheads
• Short GPU kernel durations – e.g. small inputs
- Need enough data to feed 10s of thousands of threads
• CPU overheads and I/O bottlenecks are starving the GPU
- Small operations on the CPU can quickly become dominant
• Framework inefficiencies
- E.g. unnecessary copies and hidden CPU-side overheads
VISIBILITY IS KEY
G P U P E R F O R M A N C E T U N I N G
Common Pitfalls
8. PyTorch Profiler
W i t h I n t e g r a t e d G P U P r o f i l i n g L i b r a r y
9. CONTRIBUTED BY MICROSOFT &
FACEBOOK
• PyTorch and GPU level information
• Automatic bottleneck detection
• Actionable performance
recommendations
• Data Scientist friendly lifecycle and tools
• TensorBoard Plugin - chrome traces
visualization
• OSS Kineto library - built on CUPTI
• Easy-to-use python API
• VS Code integration
libkineto
PyTorch Profiler
libCUPTI
PyTorch Process
aten operators
Python
C++
CUDA
TensorBoard
Python Events
GPU 1 GPU 2 GPU n
…
NVIDIA Driver
OS
Profiler
Plugin
CUDA Activities
CPU operators
Queue GPU ops
Traces
CPU operators
Traces
T H E P Y T O R C H P R O F I L E R
10. https://pytorch.org/tutorials/recipes/recipes/profiler.html
import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(record_shapes=True) as prof:
with profiler.record_function("model_inference"):
model(inputs)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
T H E P Y T O R C H P R O F I L E R
Profiling API : Base Usage
11. T H E P Y T O R C H P R O F I L E R
Profiling API : Tensorboard Plugin import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_
trace_handler(‘results’)
) as prof:
model(inputs)
print(prof.key_averages().table(sort_by=
"cpu_time_total", row_limit=10))
12. T H E P Y T O R C H P R O F I L E R
Profiling API : Tensorboard Plugin import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_
trace_handler(‘results’)
) as prof:
model(inputs)
print(prof.key_averages().table(sort_by=
"cpu_time_total", row_limit=10))
13. • When to trigger
• How many steps to profile
• Which activities to profile
• Results callable handler
• Extra metadata, eg shapes, stacks, memory
• Output options eg Chrome tracing , TensorBoard
T H E P Y T O R C H P R O F I L E R
Advanced
18. PyTorch Profile r Talk
T I M E L I N E T R A C E S : C P U + G P U A C T I V I T I E S
19. PyTorch Profile r Talk
T I M E L I N E T R A C I N G
Chrome Trace Viewer: CPU and GPU timelines
20. PyTorch Profile r Talk
• Can leave in permanently, no perf overhead
T I M E L I N E T R A C I N G
21. PyTorch Profile r Talk
T I M E L I N E T R A C I N G
See how CPU and GPU ops are connected
22. PyTorch Profile r Talk
Nvidia-smi shows
86% utilization
But.. only a
fraction of SMs are
actually used by
these kernels!
T I M E L I N E T R A C I N G
Inspect stats for individual activities
23. PyTorch Profile r Talk
Looks much better
after increasing input
sizes
T I M E L I N E T R A C I N G
Inspect stats for individual activities
24. Trace Analysis
E x a m p l e s f r o m M e t a w o r k l o a d s
#thanks to
Lei Tian, Natalia Gimelshein, Lingyi Liu, Feng Shi & Zhicheng Yan
for examples
25. PyTorch Profile r Talk
Issue:
1. Large periods of GPU inactivity
2. Trace does not show why
Solution:
1. Use record_function to reveal
bottlenecks on CPU
2. Parallelize CPU operations
3. Overlap CPU and GPU operations
temp = ""
num_substr = len(emb[k])
with record_function("## join_string {} ##".format(num_substr)):
temp = ",".join(str(x) for x in emb[k]) # string concatenation
with record_function("## append_record_in_else ##"):
records.append(f"{input_df.id[i + k]}t{temp}n") # list append
T R A C E A N A L Y S I S
Anti-pattern: Long GPU idle time
26. PyTorch Profile r Talk
A F T E R
def on_step(self, task) -> None:
...
with torch.no_grad():
torch._foreach_mul_(
self.ema_model_state_list, self.decay)
torch._foreach_add_(
self.ema_model_state_list,
self.param_list,
alpha=(1 - self.decay))
First issue:
• Exponential moving avg hook function has a
loop – CPU bottleneck
• Can rewrite using torch._foreach ops – loop
now on GPU
EMA HOOK 100X FASTER
ITERATION TIME: 860MS -> 770MS
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU
interactions
B E F O R E
def on_step(self, task) -> None:
...
with torch.no_grad():
it = model_state_iterator(task.base_model)
# iterate on every name & param
for name, param in it:
s = self.state.ema_model_state
s[name] = self.decay * s[name] +
(1 – self.decay) *
param.to(device= self.device)
27. PyTorch Profile r Talk
Second issue:
• Optimizer step uses a naïve implementation
of RMSProp
• PyTorch provides an optimized multi-tensor
version – using torch._foreach
• Switch to optimized version!
OPTIMIZER 12X FASTER
ITERATION TIME: 770MS -> 600MS
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU
interactions
B E F O R E
def prepare(self, param_groups):
self.optimizer = RMSpropTFV2Optimizer(
param_groups,
…
A F T E R
import torch.optim._multi_tensor as optim_mt
def prepare(self, param_groups):
self.optimizer = optim_mt.RMSprop(
param_groups,
…
28. PyTorch Profile r Talk
Third issue:
• Forward & backward pass dominated
by SyncBatchNorm
• 84x SyncBatchNorm in fwd pass
• 3x ncclAllGather per SyncBatchNorm
• Another 2x ncclAllReduce per
SyncBatchNorm in bwd pass
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU interactions
FORWARD PASS 1.5X FASTER
BACKWARD PASS 1.3X FASTER
ITERATION TIME: 600MS -> 450MS
2.2ms
1.7ms
29. PyTorch Profile r Talk
From 2.4 req/s to 1,400+ req/s
CPU inference
torch.set_num_threads(1)
Intel IPEX
Quantization
GPU inference on 1 T4 GPU
model.half()
DistilBERT
Increase batch size
Do not overpad
Faster Transformer
T R A C E A N A L Y S I S
FORWARD PASS 1.5X FASTER
BACKWARD PASS 1.3X FASTER
ITERATION TIME: 600MS -> 450MS
2.2ms
1.7ms
BERT PERFORMANCE OPTIMIZATION CASE STUDY
• From 2.4 req/s to 1,400+ req/s
• CPU inference
• torch.set_num_threads(1)
• Intel IPEX
• Quantization
• GPU inference on 1 T4 GPU
• model.half()
• DistilBERT
• Increase batch size
• Do not overpad
• Faster Transformer
Throughput P99
BERT
unoptimized
bs=1
70.67 seq/s 20.44ms
BERT
model.half()
bs=8
359 seq/s 23.58ms
DistilBERT
model.half()
bs=16
689 seq/s 22.8ms
BERT Faster
Transformer
885 seq/s 19.83ms
DistilBERT no
padding
model.half()
bs=32
1423 seq/s 19.7ms
32. PyTorch Profile r Talk
M O D E L D E P L O Y M E N T P H A S E S – P O W E R C O N S U M P T I O N
33. • Platform level caching – 6.7x
improvements
• GPU Acceleration – unlocks 10.1x
energy efficiency
• Algorithmic Optimizations – 10x
improvements
O P T I M I Z A T I O N S F O R C A R B O N F O O T P R I N T O F L M
34. 1. Data Utilization Efficiency:
Data Scaling & Sampling, Data perishability
2. Experimentation and Training Efficiency:
NAS, HPO, Multi-Objective Optimizations,
Resource Efficient Architectures
3. Efficient Environment Scalable Infrastructure:
Carbon efficient scheduling, On-device Learning, …
4. Develop easy to adopt Telemetry:
Measure and publish,
Carbon impact statement & model cards
S U S T A I N B I L I T Y M I N D S E T
https://arxiv.org/pdf/2111.00364.pdf
Source: https://docs.cohere.ai/environmental-impact
35. PyTorch Profile r Talk
• What’s new in PyTorch Profiler 1.9: https://pytorch.org/blog/pytorch-profiler-1.9-released/
• Introducing PyTorch Profiler:
https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/
• Profiler: https://pytorch.org/docs/stable/profiler.html
• Profiler Recipes: https://pytorch.org/tutorials/recipes/recipes/profiler.html
• VSCode TensorBoard support: https://devblogs.microsoft.com/python/python-in-visual-studio-code-february-2021-release/
• PyTorch Profiler Talk – PROFILING PYTORCH MODELS FOR NVIDIA GPUS:
https://gtc21.event.nvidia.com/media/Profiling%20PyTorch%20Models%20for%20NVIDIA%20GPUs%20%5BS31644%5D/1_nuwnw731
• Optimizing PyTorch Performance batch size with PyTorch Profiler: https://opendatascience.com/optimizing-pytorch-performance-
batch-size-with-pytorch-profiler/
• Kubeflow PyTorch Samples: https://github.com/kubeflow/pipelines/tree/master/samples/contrib/pytorch-samples
• PyTorch Lightning Profiler example: https://github.com/PyTorchLightning/pytorch-
lightning/blob/master/pl_examples/basic_examples/profiler_example.py
• Sustainable AI Paper: https://arxiv.org/pdf/2111.00364.pdf
• Cohere.ai Environmental Impact model cards: https://docs.cohere.ai/environmental-impact
R E F E R E N C E S