SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
P R O F I L I N G P Y T O R C H F O R
E F F I C I E N C Y & S U S T A I N A B I L I T Y
N O V 1 7 , 2 0 2 1
G E E T A C H A U H A N
P Y T O R C H P A R T N E R E N G I N E E R I N G
M E T A A I
PyTorch Profile r Talk
A G E N D A
0 1
G P U P E R F O R M A N C E T U N I N G
0 2
P Y T O R C H P R O F I L E R
0 3
T I M E L I N E T R A C I N G
0 4
O P T I M I Z A T I O N E X A M P L E S
0 5
F U R T U R E : S U S T A I N A B L E A I
GPU Performance Tuning
PyTorch Profile r Talk
Optimized for single thread performance
- Majority of chip area is control logic & caches
Complex and deep out-of-order pipelines
- Extract instruction level parallelism
The brain
- Job is to keep the accelerator busy
CPU GPU
Optimized for throughput of data-parallel problems
- Majority of chip area is functional units
Simple, relatively slow in-order pipelines
- Achieves much higher total throughput
Accelerator attached via PCIe
- Order of magnitude faster but off to the side
A DIFFERENT MENTAL MODEL REQUIRED
G P U P E R F O R M A N C E T U N I N G
PyTorch Profile r Talk
Composed of Streaming
Multiprocessors (SMs)
Volta V100: 80x SMs
Ampere A100: 108 SMs
DGX A100 with 8 GPUs:
864 SMs vs 128 CPU cores
NVIDIA Volta V100 GPU
G P U P E R F O R M A N C E T U N I N G
PyTorch Profile r Talk
G P U P E R F O R M A N C E T U N I N G
64x FP32 units
64x INT, 32x FP64, 32x LD/ST
8x Tensor Cores
5120 (6920 ON A100)
FP32 EXECUTION UNITS
PER GPU
Streaming Multiprocessor
PyTorch Profile r Talk
• Excessive CPU/GPU interactions – e.g. for loop launching GPU operations
- Dominated by launch overheads
• Short GPU kernel durations – e.g. small inputs
- Need enough data to feed 10s of thousands of threads
• CPU overheads and I/O bottlenecks are starving the GPU
- Small operations on the CPU can quickly become dominant
• Framework inefficiencies
- E.g. unnecessary copies and hidden CPU-side overheads
VISIBILITY IS KEY
G P U P E R F O R M A N C E T U N I N G
Common Pitfalls
PyTorch Profiler
W i t h I n t e g r a t e d G P U P r o f i l i n g L i b r a r y
CONTRIBUTED BY MICROSOFT &
FACEBOOK
• PyTorch and GPU level information
• Automatic bottleneck detection
• Actionable performance
recommendations
• Data Scientist friendly lifecycle and tools
• TensorBoard Plugin - chrome traces
visualization
• OSS Kineto library - built on CUPTI
• Easy-to-use python API
• VS Code integration
libkineto
PyTorch Profiler
libCUPTI
PyTorch Process
aten operators
Python
C++
CUDA
TensorBoard
Python Events
GPU 1 GPU 2 GPU n
…
NVIDIA Driver
OS
Profiler
Plugin
CUDA Activities
CPU operators
Queue GPU ops
Traces
CPU operators
Traces
T H E P Y T O R C H P R O F I L E R
https://pytorch.org/tutorials/recipes/recipes/profiler.html
import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(record_shapes=True) as prof:
with profiler.record_function("model_inference"):
model(inputs)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
T H E P Y T O R C H P R O F I L E R
Profiling API : Base Usage
T H E P Y T O R C H P R O F I L E R
Profiling API : Tensorboard Plugin import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_
trace_handler(‘results’)
) as prof:
model(inputs)
print(prof.key_averages().table(sort_by=
"cpu_time_total", row_limit=10))
T H E P Y T O R C H P R O F I L E R
Profiling API : Tensorboard Plugin import torch
import torchvision.models as models
import torch.profiler as profiler
model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(
record_shapes=True,
on_trace_ready=torch.profiler.tensorboard_
trace_handler(‘results’)
) as prof:
model(inputs)
print(prof.key_averages().table(sort_by=
"cpu_time_total", row_limit=10))
• When to trigger
• How many steps to profile
• Which activities to profile
• Results callable handler
• Extra metadata, eg shapes, stacks, memory
• Output options eg Chrome tracing , TensorBoard
T H E P Y T O R C H P R O F I L E R
Advanced
PyTorch Profile r Talk
T H E P Y T O R C H P R O F I L E R
PyTorch Profile r Talk
D I S T R I B U T E D T R A I N I N G V I E W
PyTorch Profile r Talk
V S C O D E D A T A W R A N G L E R
Timeline Tracing
PyTorch Profile r Talk
T I M E L I N E T R A C E S : C P U + G P U A C T I V I T I E S
PyTorch Profile r Talk
T I M E L I N E T R A C I N G
Chrome Trace Viewer: CPU and GPU timelines
PyTorch Profile r Talk
• Can leave in permanently, no perf overhead
T I M E L I N E T R A C I N G
PyTorch Profile r Talk
T I M E L I N E T R A C I N G
See how CPU and GPU ops are connected
PyTorch Profile r Talk
Nvidia-smi shows
86% utilization
But.. only a
fraction of SMs are
actually used by
these kernels!
T I M E L I N E T R A C I N G
Inspect stats for individual activities
PyTorch Profile r Talk
Looks much better
after increasing input
sizes
T I M E L I N E T R A C I N G
Inspect stats for individual activities
Trace Analysis
E x a m p l e s f r o m M e t a w o r k l o a d s
#thanks to
Lei Tian, Natalia Gimelshein, Lingyi Liu, Feng Shi & Zhicheng Yan
for examples
PyTorch Profile r Talk
Issue:
1. Large periods of GPU inactivity
2. Trace does not show why
Solution:
1. Use record_function to reveal
bottlenecks on CPU
2. Parallelize CPU operations
3. Overlap CPU and GPU operations
temp = ""
num_substr = len(emb[k])
with record_function("## join_string {} ##".format(num_substr)):
temp = ",".join(str(x) for x in emb[k]) # string concatenation
with record_function("## append_record_in_else ##"):
records.append(f"{input_df.id[i + k]}t{temp}n") # list append
T R A C E A N A L Y S I S
Anti-pattern: Long GPU idle time
PyTorch Profile r Talk
A F T E R
def on_step(self, task) -> None:
...
with torch.no_grad():
torch._foreach_mul_(
self.ema_model_state_list, self.decay)
torch._foreach_add_(
self.ema_model_state_list,
self.param_list,
alpha=(1 - self.decay))
First issue:
• Exponential moving avg hook function has a
loop – CPU bottleneck
• Can rewrite using torch._foreach ops – loop
now on GPU
EMA HOOK 100X FASTER
ITERATION TIME: 860MS -> 770MS
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU
interactions
B E F O R E
def on_step(self, task) -> None:
...
with torch.no_grad():
it = model_state_iterator(task.base_model)
# iterate on every name & param
for name, param in it:
s = self.state.ema_model_state
s[name] = self.decay * s[name] +
(1 – self.decay) *
param.to(device= self.device)
PyTorch Profile r Talk
Second issue:
• Optimizer step uses a naïve implementation
of RMSProp
• PyTorch provides an optimized multi-tensor
version – using torch._foreach
• Switch to optimized version!
OPTIMIZER 12X FASTER
ITERATION TIME: 770MS -> 600MS
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU
interactions
B E F O R E
def prepare(self, param_groups):
self.optimizer = RMSpropTFV2Optimizer(
param_groups,
…
A F T E R
import torch.optim._multi_tensor as optim_mt
def prepare(self, param_groups):
self.optimizer = optim_mt.RMSprop(
param_groups,
…
PyTorch Profile r Talk
Third issue:
• Forward & backward pass dominated
by SyncBatchNorm
• 84x SyncBatchNorm in fwd pass
• 3x ncclAllGather per SyncBatchNorm
• Another 2x ncclAllReduce per
SyncBatchNorm in bwd pass
T R A C E A N A L Y S I S
Anti-pattern: Excessive CPU/GPU interactions
FORWARD PASS 1.5X FASTER
BACKWARD PASS 1.3X FASTER
ITERATION TIME: 600MS -> 450MS
2.2ms
1.7ms
PyTorch Profile r Talk
From 2.4 req/s to 1,400+ req/s
CPU inference
torch.set_num_threads(1)
Intel IPEX
Quantization
GPU inference on 1 T4 GPU
model.half()
DistilBERT
Increase batch size
Do not overpad
Faster Transformer
T R A C E A N A L Y S I S
FORWARD PASS 1.5X FASTER
BACKWARD PASS 1.3X FASTER
ITERATION TIME: 600MS -> 450MS
2.2ms
1.7ms
BERT PERFORMANCE OPTIMIZATION CASE STUDY
• From 2.4 req/s to 1,400+ req/s
• CPU inference
• torch.set_num_threads(1)
• Intel IPEX
• Quantization
• GPU inference on 1 T4 GPU
• model.half()
• DistilBERT
• Increase batch size
• Do not overpad
• Faster Transformer
Throughput P99
BERT
unoptimized
bs=1
70.67 seq/s 20.44ms
BERT
model.half()
bs=8
359 seq/s 23.58ms
DistilBERT
model.half()
bs=16
689 seq/s 22.8ms
BERT Faster
Transformer
885 seq/s 19.83ms
DistilBERT no
padding
model.half()
bs=32
1423 seq/s 19.7ms
FUTURE
S u s t a i n a b l e A I
PyTorch Profile r Talk
A I M O D E L G R O W T H
PyTorch Profile r Talk
M O D E L D E P L O Y M E N T P H A S E S – P O W E R C O N S U M P T I O N
• Platform level caching – 6.7x
improvements
• GPU Acceleration – unlocks 10.1x
energy efficiency
• Algorithmic Optimizations – 10x
improvements
O P T I M I Z A T I O N S F O R C A R B O N F O O T P R I N T O F L M
1. Data Utilization Efficiency:
Data Scaling & Sampling, Data perishability
2. Experimentation and Training Efficiency:
NAS, HPO, Multi-Objective Optimizations,
Resource Efficient Architectures
3. Efficient Environment Scalable Infrastructure:
Carbon efficient scheduling, On-device Learning, …
4. Develop easy to adopt Telemetry:
Measure and publish,
Carbon impact statement & model cards
S U S T A I N B I L I T Y M I N D S E T
https://arxiv.org/pdf/2111.00364.pdf
Source: https://docs.cohere.ai/environmental-impact
PyTorch Profile r Talk
• What’s new in PyTorch Profiler 1.9: https://pytorch.org/blog/pytorch-profiler-1.9-released/
• Introducing PyTorch Profiler:
https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/
• Profiler: https://pytorch.org/docs/stable/profiler.html
• Profiler Recipes: https://pytorch.org/tutorials/recipes/recipes/profiler.html
• VSCode TensorBoard support: https://devblogs.microsoft.com/python/python-in-visual-studio-code-february-2021-release/
• PyTorch Profiler Talk – PROFILING PYTORCH MODELS FOR NVIDIA GPUS:
https://gtc21.event.nvidia.com/media/Profiling%20PyTorch%20Models%20for%20NVIDIA%20GPUs%20%5BS31644%5D/1_nuwnw731
• Optimizing PyTorch Performance batch size with PyTorch Profiler: https://opendatascience.com/optimizing-pytorch-performance-
batch-size-with-pytorch-profiler/
• Kubeflow PyTorch Samples: https://github.com/kubeflow/pipelines/tree/master/samples/contrib/pytorch-samples
• PyTorch Lightning Profiler example: https://github.com/PyTorchLightning/pytorch-
lightning/blob/master/pl_examples/basic_examples/profiler_example.py
• Sustainable AI Paper: https://arxiv.org/pdf/2111.00364.pdf
• Cohere.ai Environmental Impact model cards: https://docs.cohere.ai/environmental-impact
R E F E R E N C E S
PyTorch Profile r Talk
Questions?
Contact:
Email: gchauhan@fb.com
Linkedin: https://www.linkedin.com/in/geetachauhan/
Thank You

Más contenido relacionado

La actualidad más candente

Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersJulien SIMON
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPUKoan-Sin Tan
 
Deep Learning Workflows: Training and Inference
Deep Learning Workflows: Training and InferenceDeep Learning Workflows: Training and Inference
Deep Learning Workflows: Training and InferenceNVIDIA
 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...Simplilearn
 
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from FacebookEdge AI and Vision Alliance
 
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...Joonhyung Lee
 
Cutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneCutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneXiaoweiJiang7
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021Grigory Sapunov
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Rakuten Group, Inc.
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Databricks
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Simplifying AI Infrastructure: Lessons in Scaling on DGX Systems
Simplifying AI Infrastructure: Lessons in Scaling on DGX SystemsSimplifying AI Infrastructure: Lessons in Scaling on DGX Systems
Simplifying AI Infrastructure: Lessons in Scaling on DGX SystemsRenee Yao
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - IntroductionJungwon Kim
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 

La actualidad más candente (20)

Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPU
 
Serving models using KFServing
Serving models using KFServingServing models using KFServing
Serving models using KFServing
 
Deep Learning Workflows: Training and Inference
Deep Learning Workflows: Training and InferenceDeep Learning Workflows: Training and Inference
Deep Learning Workflows: Training and Inference
 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
 
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
 
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
Squeeze Excitation Networks, The simple idea that won the final ImageNet Chal...
 
Cutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneCutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tune
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Simplifying AI Infrastructure: Lessons in Scaling on DGX Systems
Simplifying AI Infrastructure: Lessons in Scaling on DGX SystemsSimplifying AI Infrastructure: Lessons in Scaling on DGX Systems
Simplifying AI Infrastructure: Lessons in Scaling on DGX Systems
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - Introduction
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 

Similar a Profiling PyTorch for Efficiency & Sustainability

Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java ProfilingJerry Yoakum
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Intel® Software
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowDatabricks
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchDatabricks
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science Domino Data Lab
 
GPU profiling for computer vision applications
GPU profiling for computer vision applicationsGPU profiling for computer vision applications
GPU profiling for computer vision applicationsMai Nishimura
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsemBO_Conference
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorchgeetachauhan
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityScyllaDB
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...NECST Lab @ Politecnico di Milano
 
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...inside-BigData.com
 
1032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.21032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.2Stanley Ho
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersMarina Kolpakova
 

Similar a Profiling PyTorch for Efficiency & Sustainability (20)

Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
 
GPU profiling for computer vision applications
GPU profiling for computer vision applicationsGPU profiling for computer vision applications
GPU profiling for computer vision applications
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & Observability
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
 
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
 
SOFA Tutorial
SOFA TutorialSOFA Tutorial
SOFA Tutorial
 
1032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.21032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.2
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 

Más de geetachauhan

Building AI with Security Privacy in Mind
Building AI with Security Privacy in MindBuilding AI with Security Privacy in Mind
Building AI with Security Privacy in Mindgeetachauhan
 
Building AI with Security and Privacy in mind
Building AI with Security and Privacy in mindBuilding AI with Security and Privacy in mind
Building AI with Security and Privacy in mindgeetachauhan
 
Building Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchBuilding Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchgeetachauhan
 
Future is private intel dev fest
Future is private   intel dev festFuture is private   intel dev fest
Future is private intel dev festgeetachauhan
 
Decentralized AI Draper
Decentralized AI   DraperDecentralized AI   Draper
Decentralized AI Drapergeetachauhan
 
Decentralized AI: Convergence of AI + Blockchain
Decentralized AI: Convergence of AI + Blockchain Decentralized AI: Convergence of AI + Blockchain
Decentralized AI: Convergence of AI + Blockchain geetachauhan
 
Decentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIDecentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIgeetachauhan
 
Decentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIDecentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIgeetachauhan
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaginggeetachauhan
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTechgeetachauhan
 
NIPS - Deep learning @ Edge using Intel's NCS
NIPS - Deep learning @ Edge using Intel's NCSNIPS - Deep learning @ Edge using Intel's NCS
NIPS - Deep learning @ Edge using Intel's NCSgeetachauhan
 
Best Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in EnterprisesBest Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in Enterprisesgeetachauhan
 
Deep learning @ Edge using Intel's Neural Compute Stick
Deep learning @ Edge using Intel's Neural Compute StickDeep learning @ Edge using Intel's Neural Compute Stick
Deep learning @ Edge using Intel's Neural Compute Stickgeetachauhan
 
Distributed deep learning optimizations for Finance
Distributed deep learning optimizations for FinanceDistributed deep learning optimizations for Finance
Distributed deep learning optimizations for Financegeetachauhan
 
Distributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestDistributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestgeetachauhan
 
Distributed deep learning optimizations
Distributed deep learning optimizationsDistributed deep learning optimizations
Distributed deep learning optimizationsgeetachauhan
 
Tensorflow IoT - 1 Wk coding challenge
Tensorflow IoT - 1 Wk coding challengeTensorflow IoT - 1 Wk coding challenge
Tensorflow IoT - 1 Wk coding challengegeetachauhan
 
Intel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learningIntel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learninggeetachauhan
 
Transfer learning for IoT
Transfer learning for IoTTransfer learning for IoT
Transfer learning for IoTgeetachauhan
 
Tensorflow for IoT
Tensorflow for IoTTensorflow for IoT
Tensorflow for IoTgeetachauhan
 

Más de geetachauhan (20)

Building AI with Security Privacy in Mind
Building AI with Security Privacy in MindBuilding AI with Security Privacy in Mind
Building AI with Security Privacy in Mind
 
Building AI with Security and Privacy in mind
Building AI with Security and Privacy in mindBuilding AI with Security and Privacy in mind
Building AI with Security and Privacy in mind
 
Building Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorchBuilding Interpretable & Secure AI Systems using PyTorch
Building Interpretable & Secure AI Systems using PyTorch
 
Future is private intel dev fest
Future is private   intel dev festFuture is private   intel dev fest
Future is private intel dev fest
 
Decentralized AI Draper
Decentralized AI   DraperDecentralized AI   Draper
Decentralized AI Draper
 
Decentralized AI: Convergence of AI + Blockchain
Decentralized AI: Convergence of AI + Blockchain Decentralized AI: Convergence of AI + Blockchain
Decentralized AI: Convergence of AI + Blockchain
 
Decentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIDecentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AI
 
Decentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AIDecentralized AI: Convergence of Blockchain + AI
Decentralized AI: Convergence of Blockchain + AI
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
 
NIPS - Deep learning @ Edge using Intel's NCS
NIPS - Deep learning @ Edge using Intel's NCSNIPS - Deep learning @ Edge using Intel's NCS
NIPS - Deep learning @ Edge using Intel's NCS
 
Best Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in EnterprisesBest Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in Enterprises
 
Deep learning @ Edge using Intel's Neural Compute Stick
Deep learning @ Edge using Intel's Neural Compute StickDeep learning @ Edge using Intel's Neural Compute Stick
Deep learning @ Edge using Intel's Neural Compute Stick
 
Distributed deep learning optimizations for Finance
Distributed deep learning optimizations for FinanceDistributed deep learning optimizations for Finance
Distributed deep learning optimizations for Finance
 
Distributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestDistributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBest
 
Distributed deep learning optimizations
Distributed deep learning optimizationsDistributed deep learning optimizations
Distributed deep learning optimizations
 
Tensorflow IoT - 1 Wk coding challenge
Tensorflow IoT - 1 Wk coding challengeTensorflow IoT - 1 Wk coding challenge
Tensorflow IoT - 1 Wk coding challenge
 
Intel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learningIntel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learning
 
Transfer learning for IoT
Transfer learning for IoTTransfer learning for IoT
Transfer learning for IoT
 
Tensorflow for IoT
Tensorflow for IoTTensorflow for IoT
Tensorflow for IoT
 

Último

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Profiling PyTorch for Efficiency & Sustainability

  • 1. P R O F I L I N G P Y T O R C H F O R E F F I C I E N C Y & S U S T A I N A B I L I T Y N O V 1 7 , 2 0 2 1 G E E T A C H A U H A N P Y T O R C H P A R T N E R E N G I N E E R I N G M E T A A I
  • 2. PyTorch Profile r Talk A G E N D A 0 1 G P U P E R F O R M A N C E T U N I N G 0 2 P Y T O R C H P R O F I L E R 0 3 T I M E L I N E T R A C I N G 0 4 O P T I M I Z A T I O N E X A M P L E S 0 5 F U R T U R E : S U S T A I N A B L E A I
  • 4. PyTorch Profile r Talk Optimized for single thread performance - Majority of chip area is control logic & caches Complex and deep out-of-order pipelines - Extract instruction level parallelism The brain - Job is to keep the accelerator busy CPU GPU Optimized for throughput of data-parallel problems - Majority of chip area is functional units Simple, relatively slow in-order pipelines - Achieves much higher total throughput Accelerator attached via PCIe - Order of magnitude faster but off to the side A DIFFERENT MENTAL MODEL REQUIRED G P U P E R F O R M A N C E T U N I N G
  • 5. PyTorch Profile r Talk Composed of Streaming Multiprocessors (SMs) Volta V100: 80x SMs Ampere A100: 108 SMs DGX A100 with 8 GPUs: 864 SMs vs 128 CPU cores NVIDIA Volta V100 GPU G P U P E R F O R M A N C E T U N I N G
  • 6. PyTorch Profile r Talk G P U P E R F O R M A N C E T U N I N G 64x FP32 units 64x INT, 32x FP64, 32x LD/ST 8x Tensor Cores 5120 (6920 ON A100) FP32 EXECUTION UNITS PER GPU Streaming Multiprocessor
  • 7. PyTorch Profile r Talk • Excessive CPU/GPU interactions – e.g. for loop launching GPU operations - Dominated by launch overheads • Short GPU kernel durations – e.g. small inputs - Need enough data to feed 10s of thousands of threads • CPU overheads and I/O bottlenecks are starving the GPU - Small operations on the CPU can quickly become dominant • Framework inefficiencies - E.g. unnecessary copies and hidden CPU-side overheads VISIBILITY IS KEY G P U P E R F O R M A N C E T U N I N G Common Pitfalls
  • 8. PyTorch Profiler W i t h I n t e g r a t e d G P U P r o f i l i n g L i b r a r y
  • 9. CONTRIBUTED BY MICROSOFT & FACEBOOK • PyTorch and GPU level information • Automatic bottleneck detection • Actionable performance recommendations • Data Scientist friendly lifecycle and tools • TensorBoard Plugin - chrome traces visualization • OSS Kineto library - built on CUPTI • Easy-to-use python API • VS Code integration libkineto PyTorch Profiler libCUPTI PyTorch Process aten operators Python C++ CUDA TensorBoard Python Events GPU 1 GPU 2 GPU n … NVIDIA Driver OS Profiler Plugin CUDA Activities CPU operators Queue GPU ops Traces CPU operators Traces T H E P Y T O R C H P R O F I L E R
  • 10. https://pytorch.org/tutorials/recipes/recipes/profiler.html import torch import torchvision.models as models import torch.profiler as profiler model = models.resnet18() inputs = torch.randn(5, 3, 224, 224) with profiler.profile(record_shapes=True) as prof: with profiler.record_function("model_inference"): model(inputs) print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10)) T H E P Y T O R C H P R O F I L E R Profiling API : Base Usage
  • 11. T H E P Y T O R C H P R O F I L E R Profiling API : Tensorboard Plugin import torch import torchvision.models as models import torch.profiler as profiler model = models.resnet18() inputs = torch.randn(5, 3, 224, 224) with profiler.profile( record_shapes=True, on_trace_ready=torch.profiler.tensorboard_ trace_handler(‘results’) ) as prof: model(inputs) print(prof.key_averages().table(sort_by= "cpu_time_total", row_limit=10))
  • 12. T H E P Y T O R C H P R O F I L E R Profiling API : Tensorboard Plugin import torch import torchvision.models as models import torch.profiler as profiler model = models.resnet18() inputs = torch.randn(5, 3, 224, 224) with profiler.profile( record_shapes=True, on_trace_ready=torch.profiler.tensorboard_ trace_handler(‘results’) ) as prof: model(inputs) print(prof.key_averages().table(sort_by= "cpu_time_total", row_limit=10))
  • 13. • When to trigger • How many steps to profile • Which activities to profile • Results callable handler • Extra metadata, eg shapes, stacks, memory • Output options eg Chrome tracing , TensorBoard T H E P Y T O R C H P R O F I L E R Advanced
  • 14. PyTorch Profile r Talk T H E P Y T O R C H P R O F I L E R
  • 15. PyTorch Profile r Talk D I S T R I B U T E D T R A I N I N G V I E W
  • 16. PyTorch Profile r Talk V S C O D E D A T A W R A N G L E R
  • 18. PyTorch Profile r Talk T I M E L I N E T R A C E S : C P U + G P U A C T I V I T I E S
  • 19. PyTorch Profile r Talk T I M E L I N E T R A C I N G Chrome Trace Viewer: CPU and GPU timelines
  • 20. PyTorch Profile r Talk • Can leave in permanently, no perf overhead T I M E L I N E T R A C I N G
  • 21. PyTorch Profile r Talk T I M E L I N E T R A C I N G See how CPU and GPU ops are connected
  • 22. PyTorch Profile r Talk Nvidia-smi shows 86% utilization But.. only a fraction of SMs are actually used by these kernels! T I M E L I N E T R A C I N G Inspect stats for individual activities
  • 23. PyTorch Profile r Talk Looks much better after increasing input sizes T I M E L I N E T R A C I N G Inspect stats for individual activities
  • 24. Trace Analysis E x a m p l e s f r o m M e t a w o r k l o a d s #thanks to Lei Tian, Natalia Gimelshein, Lingyi Liu, Feng Shi & Zhicheng Yan for examples
  • 25. PyTorch Profile r Talk Issue: 1. Large periods of GPU inactivity 2. Trace does not show why Solution: 1. Use record_function to reveal bottlenecks on CPU 2. Parallelize CPU operations 3. Overlap CPU and GPU operations temp = "" num_substr = len(emb[k]) with record_function("## join_string {} ##".format(num_substr)): temp = ",".join(str(x) for x in emb[k]) # string concatenation with record_function("## append_record_in_else ##"): records.append(f"{input_df.id[i + k]}t{temp}n") # list append T R A C E A N A L Y S I S Anti-pattern: Long GPU idle time
  • 26. PyTorch Profile r Talk A F T E R def on_step(self, task) -> None: ... with torch.no_grad(): torch._foreach_mul_( self.ema_model_state_list, self.decay) torch._foreach_add_( self.ema_model_state_list, self.param_list, alpha=(1 - self.decay)) First issue: • Exponential moving avg hook function has a loop – CPU bottleneck • Can rewrite using torch._foreach ops – loop now on GPU EMA HOOK 100X FASTER ITERATION TIME: 860MS -> 770MS T R A C E A N A L Y S I S Anti-pattern: Excessive CPU/GPU interactions B E F O R E def on_step(self, task) -> None: ... with torch.no_grad(): it = model_state_iterator(task.base_model) # iterate on every name & param for name, param in it: s = self.state.ema_model_state s[name] = self.decay * s[name] + (1 – self.decay) * param.to(device= self.device)
  • 27. PyTorch Profile r Talk Second issue: • Optimizer step uses a naïve implementation of RMSProp • PyTorch provides an optimized multi-tensor version – using torch._foreach • Switch to optimized version! OPTIMIZER 12X FASTER ITERATION TIME: 770MS -> 600MS T R A C E A N A L Y S I S Anti-pattern: Excessive CPU/GPU interactions B E F O R E def prepare(self, param_groups): self.optimizer = RMSpropTFV2Optimizer( param_groups, … A F T E R import torch.optim._multi_tensor as optim_mt def prepare(self, param_groups): self.optimizer = optim_mt.RMSprop( param_groups, …
  • 28. PyTorch Profile r Talk Third issue: • Forward & backward pass dominated by SyncBatchNorm • 84x SyncBatchNorm in fwd pass • 3x ncclAllGather per SyncBatchNorm • Another 2x ncclAllReduce per SyncBatchNorm in bwd pass T R A C E A N A L Y S I S Anti-pattern: Excessive CPU/GPU interactions FORWARD PASS 1.5X FASTER BACKWARD PASS 1.3X FASTER ITERATION TIME: 600MS -> 450MS 2.2ms 1.7ms
  • 29. PyTorch Profile r Talk From 2.4 req/s to 1,400+ req/s CPU inference torch.set_num_threads(1) Intel IPEX Quantization GPU inference on 1 T4 GPU model.half() DistilBERT Increase batch size Do not overpad Faster Transformer T R A C E A N A L Y S I S FORWARD PASS 1.5X FASTER BACKWARD PASS 1.3X FASTER ITERATION TIME: 600MS -> 450MS 2.2ms 1.7ms BERT PERFORMANCE OPTIMIZATION CASE STUDY • From 2.4 req/s to 1,400+ req/s • CPU inference • torch.set_num_threads(1) • Intel IPEX • Quantization • GPU inference on 1 T4 GPU • model.half() • DistilBERT • Increase batch size • Do not overpad • Faster Transformer Throughput P99 BERT unoptimized bs=1 70.67 seq/s 20.44ms BERT model.half() bs=8 359 seq/s 23.58ms DistilBERT model.half() bs=16 689 seq/s 22.8ms BERT Faster Transformer 885 seq/s 19.83ms DistilBERT no padding model.half() bs=32 1423 seq/s 19.7ms
  • 30. FUTURE S u s t a i n a b l e A I
  • 31. PyTorch Profile r Talk A I M O D E L G R O W T H
  • 32. PyTorch Profile r Talk M O D E L D E P L O Y M E N T P H A S E S – P O W E R C O N S U M P T I O N
  • 33. • Platform level caching – 6.7x improvements • GPU Acceleration – unlocks 10.1x energy efficiency • Algorithmic Optimizations – 10x improvements O P T I M I Z A T I O N S F O R C A R B O N F O O T P R I N T O F L M
  • 34. 1. Data Utilization Efficiency: Data Scaling & Sampling, Data perishability 2. Experimentation and Training Efficiency: NAS, HPO, Multi-Objective Optimizations, Resource Efficient Architectures 3. Efficient Environment Scalable Infrastructure: Carbon efficient scheduling, On-device Learning, … 4. Develop easy to adopt Telemetry: Measure and publish, Carbon impact statement & model cards S U S T A I N B I L I T Y M I N D S E T https://arxiv.org/pdf/2111.00364.pdf Source: https://docs.cohere.ai/environmental-impact
  • 35. PyTorch Profile r Talk • What’s new in PyTorch Profiler 1.9: https://pytorch.org/blog/pytorch-profiler-1.9-released/ • Introducing PyTorch Profiler: https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/ • Profiler: https://pytorch.org/docs/stable/profiler.html • Profiler Recipes: https://pytorch.org/tutorials/recipes/recipes/profiler.html • VSCode TensorBoard support: https://devblogs.microsoft.com/python/python-in-visual-studio-code-february-2021-release/ • PyTorch Profiler Talk – PROFILING PYTORCH MODELS FOR NVIDIA GPUS: https://gtc21.event.nvidia.com/media/Profiling%20PyTorch%20Models%20for%20NVIDIA%20GPUs%20%5BS31644%5D/1_nuwnw731 • Optimizing PyTorch Performance batch size with PyTorch Profiler: https://opendatascience.com/optimizing-pytorch-performance- batch-size-with-pytorch-profiler/ • Kubeflow PyTorch Samples: https://github.com/kubeflow/pipelines/tree/master/samples/contrib/pytorch-samples • PyTorch Lightning Profiler example: https://github.com/PyTorchLightning/pytorch- lightning/blob/master/pl_examples/basic_examples/profiler_example.py • Sustainable AI Paper: https://arxiv.org/pdf/2111.00364.pdf • Cohere.ai Environmental Impact model cards: https://docs.cohere.ai/environmental-impact R E F E R E N C E S
  • 36. PyTorch Profile r Talk Questions? Contact: Email: gchauhan@fb.com Linkedin: https://www.linkedin.com/in/geetachauhan/