SlideShare a Scribd company logo
1 of 49
Download to read offline
MONITORING OF GPU USAGE
WITH TENSORFLOW MODEL TRAINING USING PROMETHEUS
Diane Feddema, Principal Software Engineer
Zak Hassan, Senior Software Engineer
#RED_HAT #AICOE #CTO_OFFICE
YOUR SPEAKERS
DIANE FEDDEMA
PRINCIPAL SOFTWARE ENGINEER - ARTIFICIAL INTELLIGENCE CENTER OF EXCELLENCE, CTO OFFICE
● Currently focused on developing and applying Data Science and Machine Learning techniques for performance
analysis, automating these analyses and displaying data in novel ways.
● Previously worked as a performance engineer at the National Center for Atmospheric Research, NCAR, working on
optimizations and tuning in parallel global climate models.
ZAK HASSAN
SENIOR SOFTWARE ENGINEER - ARTIFICIAL INTELLIGENCE CENTER OF EXCELLENCE, CTO OFFICE
● Leading the log anomaly detection project within the aiops team and building a user feedback service for improved
accuracy of machine learning predictions.
● Developing data science apps and working on improved observability of machine learning systems such as spark and
tensorflow.
#RED_HAT #AICOE #CTO_OFFICE
Outline
● Story
● Concepts
○ Comparing CPU vs GPU
○ What Is Cuda and anatomy of cuda on kubernetes
○ Monitoring GPU and custom metrics with pushgateway
○ TF with Prometheus integration
○ What is Tensorflow and Pytorch
○ A Pytorch example from MLPerf
○ Tensorflow Tracing
● Examples:
○ Running Jupyter (CPU, GPU, targeting specific gpu type)
○ Mounting Training data into notebook/tf job
○ Uses of Nvidia-smi
● Demo
○ Running Detectron on a Tesla V100 with Prometheus & Grafana
monitoring
“Design the factory like you
would design an advanced
computer… In fact use
engineers that are used to doing
that and have them work on
this.”
-- Elon Musk (2016)
https://youtu.be/f9uveu-c5us
Source: https://flic.kr/p/chEftd
• unlocking
phones
WHY IS DEEP LEARNING A BIG
DEAL ?
MobileOnline
• Netflix.com
• Amazon.com
• Targeted ads
Automotive
• self driving
• voice assistant
Source: https://bit.ly/2I8zIcs
Source: https://bit.ly/2HVCaUC
PARALLEL PROCESSING
MOST LANGUAGES
SUPPORT
● MODERN HARDWARE SUPPORT
EXECUTION OF PARALLEL
PROCESSES/THREADS AND HAVE APIS
TO SPAWN PROCESSES IN PARALLEL
● YOUR ONLY LIMITS IS HOW MANY CPU
CORES YOU HAVE ON YOUR MACHINE
● CPU USED TO BE A KEY COMPONENT OF
HPC
● GPU HAS DIFFERENT ARCHITECTURE &
# OF CORES
CPU
INSTRUCTION
MEMORY
DATA
MEMORY
Input/Output
ARITHMETRIC
LOGIC UNIT
CONTROL
UNIT
Project Thoth
Hardware accelerators
● GPU
○ CUDA
○ OpenCL
● TPU
Performance Goals
Latency
Decreased
Bandwidth
Increased
Throughput
Increased
WHAT IS CUDA?
PROPRIETARY TOOLING
● hardware/software for HPC
● prerequisite is that you have nvidia cuda supported graphics cards
● ML frameworks like tensorflow, theanos, pytorch utilize cuda for leveraging
hardware acceleration
● You may get a 10x faster performance for machine learning jobs by utilizing
cuda
ANATOMY OF A CUDA
WORKLOAD ON K8S
TENSORFLOW
CUDA LIBS
CONTAINER RUNTIME
NVIDIA LIBS
HOST OS
SERVER
/dev/nvidaX
GPU
CONTAINER
HARDWARE
JUPYTER
Cli monitoring tool
Nvidia-Smi
● Tool used to display
usage metrics on
what is running on
your gpu.
TFJob + Prometheus
PROMETHEUS
ALERT
MANAGER
PULL
PUSH
PUSH
GATEWAY
NOTIFICATION
EMAIL
MESSAGING
WEBHOOK
TENSORFLOW
JOBS
TRAINING
DATA
GPU NODE
EXPLORER
Idle GPU Alert
● Alert Manager can
notify:
○ slack chat notification
○ email
○ web hook
○ more
● Get notified when your
GPU isn’t being utilized
and shut down your
VM’s in the cloud to
save on cost.
groups:
- name: nvidia_gpu.rules
rules:
- alert: UnusedResources
expr: nvidia_gpu_duty_cycle == 0
for: 10m
labels:
severity: critical
annotations:
description: GPU is not being utilized you
should scale down your gpu node
summary: GPU Node isn't being utilized
Alert On Idle GPU
CPU vs GPU
CPU vs GPU
Jupyter +TF on CPU
apiVersion: v1
kind: Pod
metadata:
name: jupyter-tf-gpu
spec:
restartPolicy: OnFailure
containers:
- name: jupyter-tf-gpu
image: "quay.io/zmhassan/fedora28:tensorflow-cpu-2.0.0-alpha0"
Jupyter+TF on GPU
apiVersion: v1
kind: Pod
metadata:
name: jupyter-tf-gpu
spec:
restartPolicy: OnFailure
containers:
- name: jupyter-tf-gpu
image: "tensorflow/tensorflow:nightly-gpu-py3-jupyter"
resources:
limits:
nvidia.com/gpu: 1
Specific GPU Node Target
apiVersion: v1
kind: Pod
metadata:
name: jupyter-tf-gpu
spec:
containers:
- name: jupyter-tf-gpu
image: "tensorflow/tensorflow:nightly-gpu-py3-jupyter"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-v100
Relabel kubernetes node
kubectl label node <node_name> 
accelerator=nvidia-tesla-k80
# or
kubectl label node <node_name> 
accelerator=nvidia-tesla-v100
Mount Training Data
AzureDisk
GlusterFS
NFS
AzureFile
Gce Persistent Disk
Aws Elastic Block
Storage
CephFS
… more
Persistent Volume Claim
● Native k8s resource
● lets you access pv
● can be used to share
data cross different
pods.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nfs
spec:
accessModes:
- ReadWriteMany
storageClassName: ""
resources:
requests:
storage: 100Gi
Persistent Volume
● native k8s resource
● can be readonly,
readWriteOnce or
readwritemany
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteMany
nfs:
server: 0.0.0.0
path: "/"
Mounting Training Data
● use persistent
volume claims to
access your data
● in this example we
us nfs but you can
choose another
type.
apiVersion: v1
kind: Pod
metadata:
name: jp-notebook
spec:
containers:
- name: jp-notebook
image: tensorflow/tensorflow:nightly-gpu-py3-jupyter
volumeMounts:
- name: my-pvc-nfs
mountPath: "/tf/data"
volumes:
- name: my-pvc-nfs
persistentVolumeClaim:
claimName: nfs
Additional Tips
● Kubernetes doesn’t support sharing gpu’s
● If your running in cloud you should look at
stopping your VM if there is no workloads
being used. Restart it when you need it. The
costs can add up.
● Use volumes to mount your data for training
and share it across your environment
Monitoring and Performance
of ML on GPUs
● Benchmarking ML on GPUs
○ Monitoring
○ Performance
● Example using MLperf together with Prometheus
and Grafana
● Computing requirements & why GPU’s for ML
Why do we need gpus to
solve these problems
● Neural Networks rely heavily on floating point matrix
multiplication
● These algorithms also require a lot of data to train
large memory (GBs) and high speed networks to
complete in a reasonable amount of time
● Faster Deep Learning training
Nvidia DGX-2
GPUGPU GPU GPU GPU GPU GPU GPU
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
GPUGPUGPUGPUGPUGPUGPUGPU
Source: Nvidia
V100V100 V100V100 V100 V100V100V100
V100V100 V100V100 V100 V100V100V100
Benchmarks in MLPerf
Application
Area
Vision Language Commerce
Reinforcement
Learning
Problem
Image classification
Object Detection (light weight and
heavy weight)
Translation Recommendations
Games
Go
Datasets
ImageNet
COCO
WMT
English-German
MovieLens-20M Go
Models
ResNet-50
Detectron
Transformer
OpenNMT
Neural Collaborative
Filtering
Mini Go
Metrics COCO mAp
Prediction accuracy
BLEU Prediction Accuracy
Prediction accuracy
Win/Loss
MLPerf Project Sponsors
University research contributors
Industry contributors
What is Tensorflow?
● Open source Python library used to implement
deep neural networks (released from Google in
2015)
● A machine learning framework
● Tools to write your own models in Python,
JavaScript or Swift
● Collection of datasets ready to use with tensorflow
● TF run in Eager and Graph mode
● TF can run on CPUs or GPUs
What is Pytorch?
● Python-based open source deep learning library
● Used to build Neural Networks
● Replacement for NumPy for use with GPUs
● Can run on CPUs or GPUs
● Uses GPUs to accelerate numerical computations
● Pytorch performs computations
85,000 Images
Identify 91 objects
Source: Cornell Project
COCO Dataset
Detectron - Example Output
MLPerf Results
[c
Source: Nvidia Developer News Dec 2018
MLPerf Results - Single Node
[c
Source: Nvidia Developer News Dec 2018
How to monitor gpus with
nvidia-smi
$ nvidia-smi
--query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.
link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,ut
ilization.memory,memory.total,memory.free,memory.used
--format=csv -l 5
Monitoring GPUs with nvidia-smi$ nvidia-smi
--query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gp
memory,memory.total,memory.free,memory.used --format=csv -l 5
2019/04/17 14:41:35.223, Tesla V100-SXM2-32GB, 00000000:06:00.0, 418.40.04, P0, 3, 3, 44, 100 %, 0 %, 32480 MiB, 24052 MiB, 8428 MiB
2019/04/17 14:41:35.225, Tesla V100-SXM2-32GB, 00000000:07:00.0, 418.40.04, P0, 3, 3, 48, 100 %, 0 %, 32480 MiB, 14565 MiB, 17915 MiB
2019/04/17 14:41:35.227, Tesla V100-SXM2-32GB, 00000000:0A:00.0, 418.40.04, P0, 3, 3, 47, 100 %, 0 %, 32480 MiB, 15773 MiB, 16707 MiB
2019/04/17 14:41:35.229, Tesla V100-SXM2-32GB, 00000000:0B:00.0, 418.40.04, P0, 3, 3, 43, 100 %, 0 %, 32480 MiB, 14363 MiB, 18117 MiB
2019/04/17 14:41:35.231, Tesla V100-SXM2-32GB, 00000000:85:00.0, 418.40.04, P0, 3, 3, 46, 100 %, 0 %, 32480 MiB, 13363 MiB, 19117 MiB
2019/04/17 14:41:35.233, Tesla V100-SXM2-32GB, 00000000:86:00.0, 418.40.04, P0, 3, 3, 46, 100 %, 0 %, 32480 MiB, 14719 MiB, 17761 MiB
2019/04/17 14:41:35.234, Tesla V100-SXM2-32GB, 00000000:89:00.0, 418.40.04, P0, 3, 3, 49, 100 %, 0 %, 32480 MiB, 15861 MiB, 16619 MiB
2019/04/17 14:41:35.236, Tesla V100-SXM2-32GB, 00000000:8A:00.0, 418.40.04, P0, 3, 3, 44, 100 %, 0 %, 32480 MiB, 12317 MiB, 20163 MiB
2019/04/17 14:41:40.239, Tesla V100-SXM2-32GB, 00000000:06:00.0, 418.40.04, P0, 3, 3, 44, 100 %, 0 %, 32480 MiB, 24052 MiB, 8428 MiB
2019/04/17 14:41:40.240, Tesla V100-SXM2-32GB, 00000000:07:00.0, 418.40.04, P0, 3, 3, 48, 100 %, 1 %, 32480 MiB, 14565 MiB, 17915 MiB
2019/04/17 14:41:40.240, Tesla V100-SXM2-32GB, 00000000:0A:00.0, 418.40.04, P0, 3, 3, 47, 100 %, 1 %, 32480 MiB, 15773 MiB, 16707 MiB
2019/04/17 14:41:40.241, Tesla V100-SXM2-32GB, 00000000:0B:00.0, 418.40.04, P0, 3, 3, 43, 100 %, 1 %, 32480 MiB, 14363 MiB, 18117 MiB
timestamp
pstate
driver_versionpci.bus.id
pcie.link.gen.current
utilization GPU [%]
memory.used [MB]
memory.free [MB]
memory.total [MB]
utilization memory [%]
temperature GPU
pcie.link.gen.max
name
How to log nvidia-smi metric
data (long/short term logging)
[cephagent@asgnode021 object_detection]$ nvidia-smi --query-gpu=index,timestamp,power.draw,clocks.sm,clocks.mem,clocks.gr
--format=csv
index, timestamp, power.draw [W], clocks.current.sm [MHz], clocks.current.memory [MHz], clocks.current.graphics [MHz]
0, 2019/04/17 15:25:33.862, 68.71 W, 1530 MHz, 877 MHz, 1530 MHz
1, 2019/04/17 15:25:33.865, 77.53 W, 1530 MHz, 877 MHz, 1530 MHz
2, 2019/04/17 15:25:33.868, 74.54 W, 1530 MHz, 877 MHz, 1530 MHz
3, 2019/04/17 15:25:33.870, 146.91 W, 1530 MHz, 877 MHz, 1530 MHz
4, 2019/04/17 15:25:33.873, 143.57 W, 1530 MHz, 877 MHz, 1530 MHz
5, 2019/04/17 15:25:33.875, 76.06 W, 1530 MHz, 877 MHz, 1530 MHz
6, 2019/04/17 15:25:33.878, 77.58 W, 1530 MHz, 877 MHz, 1530 MHz
7, 2019/04/17 15:25:33.881, 74.15 W, 1530 MHz, 877 MHz, 1530 MHz
Tensorflow Tracing
import tensorflow as tf
import numpy as np
from tensorflow.python.client import timeline
shape = (5000, 5000)
device_name = "/gpu:0"
random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1)
random_matrix2 = tf.random_uniform(shape=shape, minval=0, maxval=1)
dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix2))
with tf.Session() as sess:
# add options to trace the session execution
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
result = sess.run(dot_operation, options=options, run_metadata=run_metadata)
print(result)
# Create the Timeline object and write it to a json file
fetched_timeline = timeline.Timeline(run_metadata.step_stats)
chrome_trace = fetched_timeline.generate_chrome_trace_format()
with open('timeline_01.json', 'w') as f:
f.write(chrome_trace)
Tensorflow Tracing
DEMO
Questions?

More Related Content

What's hot

Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Simplilearn
 
NVIDIA GTC 2020 October Summary
NVIDIA GTC 2020 October SummaryNVIDIA GTC 2020 October Summary
NVIDIA GTC 2020 October Summary
NVIDIA
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
inside-BigData.com
 
Memory Efficient Graph Convolutional Network based Distributed Link Prediction
Memory Efficient Graph Convolutional Network based Distributed Link PredictionMemory Efficient Graph Convolutional Network based Distributed Link Prediction
Memory Efficient Graph Convolutional Network based Distributed Link Prediction
miyurud
 

What's hot (20)

[PR12] PixelRNN- Jaejun Yoo
[PR12] PixelRNN- Jaejun Yoo[PR12] PixelRNN- Jaejun Yoo
[PR12] PixelRNN- Jaejun Yoo
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural Networks
 
NVIDIA GTC 2020 October Summary
NVIDIA GTC 2020 October SummaryNVIDIA GTC 2020 October Summary
NVIDIA GTC 2020 October Summary
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - Introduction
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Neuromorphic computing for neural networks
Neuromorphic computing for neural networksNeuromorphic computing for neural networks
Neuromorphic computing for neural networks
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
 
On-device ML with TFLite
On-device ML with TFLiteOn-device ML with TFLite
On-device ML with TFLite
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
 
Memory Efficient Graph Convolutional Network based Distributed Link Prediction
Memory Efficient Graph Convolutional Network based Distributed Link PredictionMemory Efficient Graph Convolutional Network based Distributed Link Prediction
Memory Efficient Graph Convolutional Network based Distributed Link Prediction
 

Similar to Monitoring of GPU Usage with Tensorflow Models Using Prometheus

GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
NVIDIA Taiwan
 

Similar to Monitoring of GPU Usage with Tensorflow Models Using Prometheus (20)

Implementing AI: High Performace Architectures
Implementing AI: High Performace ArchitecturesImplementing AI: High Performace Architectures
Implementing AI: High Performace Architectures
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
Deep learning at scale in Azure
Deep learning at scale in AzureDeep learning at scale in Azure
Deep learning at scale in Azure
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
 
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
 
Enabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. LowndesEnabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. Lowndes
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI Platform
 
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
 
Cracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksCracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworks
 
Using apache mx net in production deep learning streaming pipelines
Using apache mx net in production deep learning streaming pipelinesUsing apache mx net in production deep learning streaming pipelines
Using apache mx net in production deep learning streaming pipelines
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligence
 
Session 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data BenchmarksSession 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data Benchmarks
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
 
2016 06 nvidia-isc_supercomputing_car_v02
2016 06 nvidia-isc_supercomputing_car_v022016 06 nvidia-isc_supercomputing_car_v02
2016 06 nvidia-isc_supercomputing_car_v02
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
GTC Taiwan 2017 在 Google Cloud 當中使用 GPU 進行效能最佳化
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
hwhqz6r1y
 
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat ViagraToko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
adet6151
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
ju0dztxtn
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
0uyfyq0q4
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
ppy8zfkfm
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 

Recently uploaded (20)

Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
 
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat ViagraToko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 

Monitoring of GPU Usage with Tensorflow Models Using Prometheus

  • 1. MONITORING OF GPU USAGE WITH TENSORFLOW MODEL TRAINING USING PROMETHEUS Diane Feddema, Principal Software Engineer Zak Hassan, Senior Software Engineer #RED_HAT #AICOE #CTO_OFFICE
  • 2. YOUR SPEAKERS DIANE FEDDEMA PRINCIPAL SOFTWARE ENGINEER - ARTIFICIAL INTELLIGENCE CENTER OF EXCELLENCE, CTO OFFICE ● Currently focused on developing and applying Data Science and Machine Learning techniques for performance analysis, automating these analyses and displaying data in novel ways. ● Previously worked as a performance engineer at the National Center for Atmospheric Research, NCAR, working on optimizations and tuning in parallel global climate models. ZAK HASSAN SENIOR SOFTWARE ENGINEER - ARTIFICIAL INTELLIGENCE CENTER OF EXCELLENCE, CTO OFFICE ● Leading the log anomaly detection project within the aiops team and building a user feedback service for improved accuracy of machine learning predictions. ● Developing data science apps and working on improved observability of machine learning systems such as spark and tensorflow. #RED_HAT #AICOE #CTO_OFFICE
  • 3. Outline ● Story ● Concepts ○ Comparing CPU vs GPU ○ What Is Cuda and anatomy of cuda on kubernetes ○ Monitoring GPU and custom metrics with pushgateway ○ TF with Prometheus integration ○ What is Tensorflow and Pytorch ○ A Pytorch example from MLPerf ○ Tensorflow Tracing ● Examples: ○ Running Jupyter (CPU, GPU, targeting specific gpu type) ○ Mounting Training data into notebook/tf job ○ Uses of Nvidia-smi ● Demo ○ Running Detectron on a Tesla V100 with Prometheus & Grafana monitoring
  • 4. “Design the factory like you would design an advanced computer… In fact use engineers that are used to doing that and have them work on this.” -- Elon Musk (2016) https://youtu.be/f9uveu-c5us Source: https://flic.kr/p/chEftd
  • 5. • unlocking phones WHY IS DEEP LEARNING A BIG DEAL ? MobileOnline • Netflix.com • Amazon.com • Targeted ads Automotive • self driving • voice assistant
  • 8. PARALLEL PROCESSING MOST LANGUAGES SUPPORT ● MODERN HARDWARE SUPPORT EXECUTION OF PARALLEL PROCESSES/THREADS AND HAVE APIS TO SPAWN PROCESSES IN PARALLEL ● YOUR ONLY LIMITS IS HOW MANY CPU CORES YOU HAVE ON YOUR MACHINE ● CPU USED TO BE A KEY COMPONENT OF HPC ● GPU HAS DIFFERENT ARCHITECTURE & # OF CORES CPU INSTRUCTION MEMORY DATA MEMORY Input/Output ARITHMETRIC LOGIC UNIT CONTROL UNIT
  • 9.
  • 10.
  • 12. Hardware accelerators ● GPU ○ CUDA ○ OpenCL ● TPU
  • 14.
  • 15. WHAT IS CUDA? PROPRIETARY TOOLING ● hardware/software for HPC ● prerequisite is that you have nvidia cuda supported graphics cards ● ML frameworks like tensorflow, theanos, pytorch utilize cuda for leveraging hardware acceleration ● You may get a 10x faster performance for machine learning jobs by utilizing cuda
  • 16. ANATOMY OF A CUDA WORKLOAD ON K8S TENSORFLOW CUDA LIBS CONTAINER RUNTIME NVIDIA LIBS HOST OS SERVER /dev/nvidaX GPU CONTAINER HARDWARE JUPYTER
  • 17. Cli monitoring tool Nvidia-Smi ● Tool used to display usage metrics on what is running on your gpu.
  • 19. Idle GPU Alert ● Alert Manager can notify: ○ slack chat notification ○ email ○ web hook ○ more ● Get notified when your GPU isn’t being utilized and shut down your VM’s in the cloud to save on cost. groups: - name: nvidia_gpu.rules rules: - alert: UnusedResources expr: nvidia_gpu_duty_cycle == 0 for: 10m labels: severity: critical annotations: description: GPU is not being utilized you should scale down your gpu node summary: GPU Node isn't being utilized
  • 23. Jupyter +TF on CPU apiVersion: v1 kind: Pod metadata: name: jupyter-tf-gpu spec: restartPolicy: OnFailure containers: - name: jupyter-tf-gpu image: "quay.io/zmhassan/fedora28:tensorflow-cpu-2.0.0-alpha0"
  • 24. Jupyter+TF on GPU apiVersion: v1 kind: Pod metadata: name: jupyter-tf-gpu spec: restartPolicy: OnFailure containers: - name: jupyter-tf-gpu image: "tensorflow/tensorflow:nightly-gpu-py3-jupyter" resources: limits: nvidia.com/gpu: 1
  • 25. Specific GPU Node Target apiVersion: v1 kind: Pod metadata: name: jupyter-tf-gpu spec: containers: - name: jupyter-tf-gpu image: "tensorflow/tensorflow:nightly-gpu-py3-jupyter" resources: limits: nvidia.com/gpu: 1 nodeSelector: accelerator: nvidia-tesla-v100
  • 26. Relabel kubernetes node kubectl label node <node_name> accelerator=nvidia-tesla-k80 # or kubectl label node <node_name> accelerator=nvidia-tesla-v100
  • 27. Mount Training Data AzureDisk GlusterFS NFS AzureFile Gce Persistent Disk Aws Elastic Block Storage CephFS … more
  • 28. Persistent Volume Claim ● Native k8s resource ● lets you access pv ● can be used to share data cross different pods. kind: PersistentVolumeClaim apiVersion: v1 metadata: name: nfs spec: accessModes: - ReadWriteMany storageClassName: "" resources: requests: storage: 100Gi
  • 29. Persistent Volume ● native k8s resource ● can be readonly, readWriteOnce or readwritemany apiVersion: v1 kind: PersistentVolume metadata: name: nfs spec: capacity: storage: 100Gi accessModes: - ReadWriteMany nfs: server: 0.0.0.0 path: "/"
  • 30. Mounting Training Data ● use persistent volume claims to access your data ● in this example we us nfs but you can choose another type. apiVersion: v1 kind: Pod metadata: name: jp-notebook spec: containers: - name: jp-notebook image: tensorflow/tensorflow:nightly-gpu-py3-jupyter volumeMounts: - name: my-pvc-nfs mountPath: "/tf/data" volumes: - name: my-pvc-nfs persistentVolumeClaim: claimName: nfs
  • 31. Additional Tips ● Kubernetes doesn’t support sharing gpu’s ● If your running in cloud you should look at stopping your VM if there is no workloads being used. Restart it when you need it. The costs can add up. ● Use volumes to mount your data for training and share it across your environment
  • 32. Monitoring and Performance of ML on GPUs ● Benchmarking ML on GPUs ○ Monitoring ○ Performance ● Example using MLperf together with Prometheus and Grafana ● Computing requirements & why GPU’s for ML
  • 33. Why do we need gpus to solve these problems ● Neural Networks rely heavily on floating point matrix multiplication ● These algorithms also require a lot of data to train large memory (GBs) and high speed networks to complete in a reasonable amount of time ● Faster Deep Learning training
  • 34. Nvidia DGX-2 GPUGPU GPU GPU GPU GPU GPU GPU DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM GPUGPUGPUGPUGPUGPUGPUGPU Source: Nvidia V100V100 V100V100 V100 V100V100V100 V100V100 V100V100 V100 V100V100V100
  • 35. Benchmarks in MLPerf Application Area Vision Language Commerce Reinforcement Learning Problem Image classification Object Detection (light weight and heavy weight) Translation Recommendations Games Go Datasets ImageNet COCO WMT English-German MovieLens-20M Go Models ResNet-50 Detectron Transformer OpenNMT Neural Collaborative Filtering Mini Go Metrics COCO mAp Prediction accuracy BLEU Prediction Accuracy Prediction accuracy Win/Loss
  • 36. MLPerf Project Sponsors University research contributors Industry contributors
  • 37. What is Tensorflow? ● Open source Python library used to implement deep neural networks (released from Google in 2015) ● A machine learning framework ● Tools to write your own models in Python, JavaScript or Swift ● Collection of datasets ready to use with tensorflow ● TF run in Eager and Graph mode ● TF can run on CPUs or GPUs
  • 38. What is Pytorch? ● Python-based open source deep learning library ● Used to build Neural Networks ● Replacement for NumPy for use with GPUs ● Can run on CPUs or GPUs ● Uses GPUs to accelerate numerical computations ● Pytorch performs computations
  • 39. 85,000 Images Identify 91 objects Source: Cornell Project COCO Dataset
  • 41. MLPerf Results [c Source: Nvidia Developer News Dec 2018
  • 42. MLPerf Results - Single Node [c Source: Nvidia Developer News Dec 2018
  • 43. How to monitor gpus with nvidia-smi $ nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie. link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,ut ilization.memory,memory.total,memory.free,memory.used --format=csv -l 5
  • 44. Monitoring GPUs with nvidia-smi$ nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gp memory,memory.total,memory.free,memory.used --format=csv -l 5 2019/04/17 14:41:35.223, Tesla V100-SXM2-32GB, 00000000:06:00.0, 418.40.04, P0, 3, 3, 44, 100 %, 0 %, 32480 MiB, 24052 MiB, 8428 MiB 2019/04/17 14:41:35.225, Tesla V100-SXM2-32GB, 00000000:07:00.0, 418.40.04, P0, 3, 3, 48, 100 %, 0 %, 32480 MiB, 14565 MiB, 17915 MiB 2019/04/17 14:41:35.227, Tesla V100-SXM2-32GB, 00000000:0A:00.0, 418.40.04, P0, 3, 3, 47, 100 %, 0 %, 32480 MiB, 15773 MiB, 16707 MiB 2019/04/17 14:41:35.229, Tesla V100-SXM2-32GB, 00000000:0B:00.0, 418.40.04, P0, 3, 3, 43, 100 %, 0 %, 32480 MiB, 14363 MiB, 18117 MiB 2019/04/17 14:41:35.231, Tesla V100-SXM2-32GB, 00000000:85:00.0, 418.40.04, P0, 3, 3, 46, 100 %, 0 %, 32480 MiB, 13363 MiB, 19117 MiB 2019/04/17 14:41:35.233, Tesla V100-SXM2-32GB, 00000000:86:00.0, 418.40.04, P0, 3, 3, 46, 100 %, 0 %, 32480 MiB, 14719 MiB, 17761 MiB 2019/04/17 14:41:35.234, Tesla V100-SXM2-32GB, 00000000:89:00.0, 418.40.04, P0, 3, 3, 49, 100 %, 0 %, 32480 MiB, 15861 MiB, 16619 MiB 2019/04/17 14:41:35.236, Tesla V100-SXM2-32GB, 00000000:8A:00.0, 418.40.04, P0, 3, 3, 44, 100 %, 0 %, 32480 MiB, 12317 MiB, 20163 MiB 2019/04/17 14:41:40.239, Tesla V100-SXM2-32GB, 00000000:06:00.0, 418.40.04, P0, 3, 3, 44, 100 %, 0 %, 32480 MiB, 24052 MiB, 8428 MiB 2019/04/17 14:41:40.240, Tesla V100-SXM2-32GB, 00000000:07:00.0, 418.40.04, P0, 3, 3, 48, 100 %, 1 %, 32480 MiB, 14565 MiB, 17915 MiB 2019/04/17 14:41:40.240, Tesla V100-SXM2-32GB, 00000000:0A:00.0, 418.40.04, P0, 3, 3, 47, 100 %, 1 %, 32480 MiB, 15773 MiB, 16707 MiB 2019/04/17 14:41:40.241, Tesla V100-SXM2-32GB, 00000000:0B:00.0, 418.40.04, P0, 3, 3, 43, 100 %, 1 %, 32480 MiB, 14363 MiB, 18117 MiB timestamp pstate driver_versionpci.bus.id pcie.link.gen.current utilization GPU [%] memory.used [MB] memory.free [MB] memory.total [MB] utilization memory [%] temperature GPU pcie.link.gen.max name
  • 45. How to log nvidia-smi metric data (long/short term logging) [cephagent@asgnode021 object_detection]$ nvidia-smi --query-gpu=index,timestamp,power.draw,clocks.sm,clocks.mem,clocks.gr --format=csv index, timestamp, power.draw [W], clocks.current.sm [MHz], clocks.current.memory [MHz], clocks.current.graphics [MHz] 0, 2019/04/17 15:25:33.862, 68.71 W, 1530 MHz, 877 MHz, 1530 MHz 1, 2019/04/17 15:25:33.865, 77.53 W, 1530 MHz, 877 MHz, 1530 MHz 2, 2019/04/17 15:25:33.868, 74.54 W, 1530 MHz, 877 MHz, 1530 MHz 3, 2019/04/17 15:25:33.870, 146.91 W, 1530 MHz, 877 MHz, 1530 MHz 4, 2019/04/17 15:25:33.873, 143.57 W, 1530 MHz, 877 MHz, 1530 MHz 5, 2019/04/17 15:25:33.875, 76.06 W, 1530 MHz, 877 MHz, 1530 MHz 6, 2019/04/17 15:25:33.878, 77.58 W, 1530 MHz, 877 MHz, 1530 MHz 7, 2019/04/17 15:25:33.881, 74.15 W, 1530 MHz, 877 MHz, 1530 MHz
  • 46. Tensorflow Tracing import tensorflow as tf import numpy as np from tensorflow.python.client import timeline shape = (5000, 5000) device_name = "/gpu:0" random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1) random_matrix2 = tf.random_uniform(shape=shape, minval=0, maxval=1) dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix2)) with tf.Session() as sess: # add options to trace the session execution options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) run_metadata = tf.RunMetadata() result = sess.run(dot_operation, options=options, run_metadata=run_metadata) print(result) # Create the Timeline object and write it to a json file fetched_timeline = timeline.Timeline(run_metadata.step_stats) chrome_trace = fetched_timeline.generate_chrome_trace_format() with open('timeline_01.json', 'w') as f: f.write(chrome_trace)
  • 48. DEMO