SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Quantizing Deep Networks for
Efficient Inference at the edge
Raghu Krishnamoorthi, Facebook
Questions/Feedback: raghuraman@fb.com
Acknowledgements
• Results presented here are from work done at Google as part of the
Tensorflow lite team and work at facebook as part of the pytorch
team.
• Acknowledge contributions from several colleagues at Google
including: Benoit Jacob, Skiramantas Kligys, Dmitry Kalachenko,
Suharsh Sivakumar and Pete Warden.
• Also acknowledge work from colleagues at facebook: Jongsoo Park,
Maxim Naumov, Summer Deng, Marat Dukhan, Bichen Wu, Peizhao
Zhang, Jerry Zhang, Dmytro Dzhulgakov, Daya Khudia, Jianyu Huang,
James Reed, Mikhail Z, Haixin Liu and Peter Vajda.
Outline
• Motivation
• Quantization: Overview
• Quantizing deep networks
• Post Training quantization
• Quantization aware training
• Lower precision inference
• Hardware accelerator recommendations
• Model system co-design
• Looking ahead
Motivation(1)
• Data-center power consumption is doubling every year
Source: Deep Learning Inference in Facebook Data-Centers [1]
Motivation(2)
• Number of edge devices is
growing rapidly, lots of these
devices are resource
constrained.
Source: https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/
Motivation(3)
• While models are
becoming more efficient,
high accuracy still implies
high complexity
From: Benchmark Analysis of
Representative Deep Neural Network
Architectures, Simone Bianco et al,
Quantization
• Many approaches to solve the problems outlined here:
• Better hardware accelerators: TPUs => Requires new custom hardware
• Optimized kernels: Cudnn, Intel MKL-DNN
• Efficient deep network architectures: Nasnet, Mobilenet, FBNet => Requires
new architectures
• A simpler approach that does not require re-design of models/new
hardware is quantization.
• Quantization refers to techniques to perform computation and storage at
reduced precision
• Works in combination with above approaches
• Requires optimized kernels to efficiently use existing hardware.
Background: Quantization(1)
• Quantization refers to mapping values from fp32 to a lower precision
format.
• Specified by
• Format
• Mapping type
• Granularity
From:https://commons.wikimedia.org/w/index.
php?curid=69415943
fp32
fp16
bfloat16
int8
int4
binary
Background:Quantization(2)
• We also consider different granularities of quantization:
• Per layer quantization
• Same mapping for all elements in a layer.
• Per row/ Per-channel quantization:
• Choose quantizer parameters independently for each row (fc layers) or for
each conv kernel (conv layers)
• Outlier aware quantization:
• Separate outliers to use lower precision arithmetic for bulk of weights.
• Dense computations for inliers with sparse computation for outliers that have
a large magnitude
Modeling quantization during
training
• Emulate quantization by quantizing and
de-quantizing in succession
• Values are still in floating point, but with
reduced precision
• 𝑥 𝑜𝑢𝑡 = 𝐹𝑎𝑘𝑒𝑄𝑢𝑎𝑛𝑡 𝑥
= 𝑠. 𝐶𝑙𝑎𝑚𝑝 𝑟𝑜𝑢𝑛𝑑
𝑥
𝑠
− 𝑧 + 𝑧
= 𝐷𝑒𝑄𝑢𝑎𝑛𝑡(𝑄𝑢𝑎𝑛𝑡 𝑥 )
• Can also model quantization as a
stochastic rounding operation
Fake Quantizer (top), showing the
quantization of output values.
Approximation for purposes of
derivative calculation (bottom).
Quantization:Benefits
Benefits Quantization
Applicability Broad applicability across models and use cases
Support Supported by x86, Nvidia Volta, ARM, Mali, QDSP
Software Support Kernel libraries widely available
Memory Size 4x reduction
Memory Bandwidth/Cache 4x reduction
Compute 2x to 4x speedup, depending on ISA
Power Typically 4x (dominated by memory access)
o Comparing float32 implementations with 8 bit inference
Quantization: Challenges
Challenges Notes Mitigation
Accuracy drop Loss in accuracy can be too high for certain
applications
Quantization aware training
Kernel Support Wide variety of operators+multiple
hardware platforms
Improving software tool-chain
(TVM) to handle varied
backends.
“Complexity” Non-trivial: Requires calibration/training in
some cases
Support in software packages:
TensorRT, Tensorflow and
Pytorch
Quantizing deep networks
Model Quantization: Overview
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization
Fake Quantization
Quantization
What to quantize?
• Only quantize parts of network that contribute significantly to performance
• Roofline analysis to identify compute vs memory bandwidth bound operations
• May need to further reduce based on accuracy impact.
• Multiple ways to quantize a network with different impact:
Quantization scheme Memory bandwidth
reduction (Weights)
Memory bandwidth
reduction (Activations)
Compute
Speedup
Notes
Weight only
quantization to int8
4x 1x 1x Suitable for
embedding lookups
Dynamic
quantization
4x 1x 2x Suitable for fc layers
with small batches
Static quantization
(int32 accumulators)
4x 4x 2x Suited for all layers,
important for
convolutions
Static quantization
(int16 accumulators)
4x 4x 4x Requires lower
precision
weights/activations
Post training quantization:
Weight compression
• Simplest quantization scheme is to compress the
weights to lower precision
• Requires no input data and can be done statically as part of
preparing a model for inference
• Hardware accelerators can benefit if de-compression is
done after memory access
• Trivial for case of fp16/int8 quantization of weights.
• K-means compression is also supported in select platforms
and is amenable to simple de-compression
• Scatter-Gather operation in select processors
• Supported in CoreML
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization
Dynamic quantization
• Dynamic quantization refers to schemes where the activations are
read/written in fp32 and are dynamically quantized to lower precisions for
compute.
• Requires no calibration data
• Data exchanged between operations is in floating point, so no need to
worry about format conversion.
• Provides performance improvements close to static quantization when
memory access is dominated by weights
• Suitable for inference in RNNs
• Lesser gains for conv layers
• Supported by:
• Pytorch
• Tensorflow Lite
Quantizing weights and activations
• Post training quantization refers to quantizing both weights and
activations to reduced precision, typically int8.
• Requires estimation of statistics of activations for determining
quantizer parameters.
• Quantizer parameters are determined by minimizing an error metric:
• KL Divergence: TensorRT
• Saturation error: Tensorflow Lite
Results
Setup
• Standard classification model architectures
• Evaluate classification on imagenet validation dataset.
• Results obtained using Tensorflow, more details are at:
https://arxiv.org/abs/1806.08342
• More results and support from pytorch for quantizations to be
announced at Pytorch Devcon on October 10th.
Post training quantization: Results
Network Asymmetric,
Per Layer
Symmetric,
Per Channel
Asymmetric,
Per Channel
Activation
only
quantized
Weight only,
Symmetric,
Per Channel
Floating point
Mobilenet-
v2-1-224
0.001 0.698 0.697 0.7 0.698 0.719
Mobilenet-
v1-1-224
0.001 0.591 0.703 0.708 0.591 0.709
Nasnet-
Mobile
0.72 0.72 0.74 0.74 0.72 0.74
Inception-v3 0.78 0.78 0.78 0.78 0.78 0.78
Resnet-v1-50 0.75 0.75 0.75 0.751 0.75 0.752
• 8 bits for weights and activations is sufficient for common CV classification tasks
• Smaller networks are “harder” to quantize
• At 8 bits, accuracy drop is dominated by weight quantization
Quantization aware training
Results
Network Asymmetric,
Per Layer, post
training quant
Symmetric, Per
Channel, post
training quant
Asymmetric,
Per Layer,
QAT
Symmetric, Per
channel QAT
Floating point
Mobilenet-v2-1-
224
0.001 0.698 0.709 0.711 0.719
Mobilenet-v1-1-
224
0.001 0.591 0.70 0.707 0.709
Nasnet-Mobile 0.72 0.72 0.73 0.73 0.74
Inception-v3 0.78 0.78 0.78 0.78 0.78
Resnet-v1-50 0.75 0.75 0.75 0.751 0.752
• Quantization aware training provides the best accuracy and allows for simpler quantization
schemes.
Performance: Operator level benchmarks
Server: FBGEMM (quantized) vs MKL-DNN (fp32)
Performance: Model level benchmarks
(Mobile: Tensorflow Lite)
Mobile: Inference time: float vs quantized, TFLite, Pixel2
QNNPACK kernels provides an additional 2x speedup
Lower precision inference(1)
• Four bit precisions for weights
provides good accuracy, but
needs to be done selectively.
• Larger networks are more robust
to lower precision
• Quantization aware training is
critical
• Selectively quantizing layers of a
network to different precisions can
reduce the accuracy drop
4-bit weights, 8 bit activations: Top-1 accuracy results
Lower precision inference(2)
• Different layers of a
neural network
have different
sensitivity to
quantization errors
• Exciting work on
Differentiable
architecture search
[7] for determining
precision
allocations across
layers, showing
excellent
performance:
Architecture trade-offs(1)
• Clear tradeoff between number of parameters and robustness to
quantization
• One can also tradeoff number of feature maps vs precision
• Having 2x number of feature maps at 4-bits is better than 8 bit quantization of the
base network.
Architecture tradeoffs (2)
• Restricting ranges of activations apriori can hurt accuracy
• Preferable to learn activation ranges instead of fixing them beforehand.
Co-design: Training quantized models
• Designing models that provide good quantization performance requires co-
design of model architecture, training algorithms and hardware.
• Specific training enhancements include:
• Fine tune from floating point models for building quantized models.
• Freeze batch normalization statistics update to exactly model inference for further
benefits.
• Model exact rounding arithmetic done in hardware during training
• Stochastic quantization provides models robust to random perturbations of weights, but
underperforms techniques that model quantization as done at inference.
• Other enhancements to improve accuracy:
• Use distillation to train quantized student from floating point teacher network [3]
Conclusions
Hardware accelerator recommendations:
Basics
• Optimize memory bandwidth
• First order predictor of power consumption
• Don’t ignore activations: Most literature focusses on weights, activations can be very
significant for large resolution inputs.
• Fuse multiple operations
• Have floating point support as a backup
• Avoid switching compute to different hardware
• Optimize for GEMM
• Still the workhorse for most DNN applications
• Support low precision inference
• 8 is required, but supporting lower precision can provide really high throughput.
Hardware accelerator recommendations:
Software
• Don’t forget the software toolchain!
• Need to make it easy for customers to use hardware
• Integration with Tensorflow/Pytorch is important
• Writing optimized kernels for new hardware is hard
• Most implementations optimize for a specific set of models, with poor
performance for kernels needed for other models.
Hardware accelerator recommendations:
Differentiation
• Build a strategy for operator support
• Take a close look at TVM/MLIR efforts.
• Code generation along with hand written kernels
• To get the best out of hardware w.quantization
• Provide exact models of HW kernels for integration with training frameworks
• Consider other techniques beyond quantization
• Sparsity
• K-means compression
• Dynamic/Adaptive execution
• Don’t forget privacy
• Secure aggregation/Homo-morphic encryption becoming increasingly important
• Training at the edge:
• Depending on applications, this can be very important for privacy/personalization
References
1. J Park, M. Naumov et al, “Deep Learning Inference in Facebook Data Centers:
Characterization, Performance Optimizations and Hardware Implications”
2. Simone Bianco et al Benchmark Analysis of Representative Deep Neural Network
Architectures
3. A. Polino et al, “Model compression via distillation and quantization”
4. B.Jacob, S.Kligys et al, “Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference”
5. M.Courbariaux, Y.Bengio et al, “Binaryconnect: Training deep neural networks with
binary weights during propagations”
6. R.Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A
whitepaper”
7. B.Wu et al, “Mixed precision quantization of convnets via differentiable neural
architecture search”

Más contenido relacionado

La actualidad más candente

What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
Simplilearn
 

La actualidad más candente (20)

An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep Learning
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
Recurrent neural networks
Recurrent neural networksRecurrent neural networks
Recurrent neural networks
 
Handwritten digits recognition report
Handwritten digits recognition reportHandwritten digits recognition report
Handwritten digits recognition report
 
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
CNN Machine learning DeepLearning
CNN Machine learning DeepLearningCNN Machine learning DeepLearning
CNN Machine learning DeepLearning
 
Deep Learning - Overview of my work II
Deep Learning - Overview of my work IIDeep Learning - Overview of my work II
Deep Learning - Overview of my work II
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Introduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowIntroduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlow
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief Networks
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learning
 
What is Deep Learning?
What is Deep Learning?What is Deep Learning?
What is Deep Learning?
 
Digit recognition using mnist database
Digit recognition using mnist databaseDigit recognition using mnist database
Digit recognition using mnist database
 
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
 

Similar a "Quantizing Deep Networks for Efficient Inference at the Edge," a Presentation from Facebook

Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
taeseon ryu
 

Similar a "Quantizing Deep Networks for Efficient Inference at the Edge," a Presentation from Facebook (20)

Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Introduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural NetworksIntroduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural Networks
 
Introduction to computer vision
Introduction to computer visionIntroduction to computer vision
Introduction to computer vision
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]
 
Vertex Perspectives | AI Optimized Chipsets | Part II
Vertex Perspectives | AI Optimized Chipsets | Part IIVertex Perspectives | AI Optimized Chipsets | Part II
Vertex Perspectives | AI Optimized Chipsets | Part II
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
 
High Efficiency Video Codec
High Efficiency Video CodecHigh Efficiency Video Codec
High Efficiency Video Codec
 
Intelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiencyIntelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiency
 

Más de Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
Edge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
Edge AI and Vision Alliance
 

Más de Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 

Último

Último (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentation from Facebook

  • 1. Quantizing Deep Networks for Efficient Inference at the edge Raghu Krishnamoorthi, Facebook Questions/Feedback: raghuraman@fb.com
  • 2. Acknowledgements • Results presented here are from work done at Google as part of the Tensorflow lite team and work at facebook as part of the pytorch team. • Acknowledge contributions from several colleagues at Google including: Benoit Jacob, Skiramantas Kligys, Dmitry Kalachenko, Suharsh Sivakumar and Pete Warden. • Also acknowledge work from colleagues at facebook: Jongsoo Park, Maxim Naumov, Summer Deng, Marat Dukhan, Bichen Wu, Peizhao Zhang, Jerry Zhang, Dmytro Dzhulgakov, Daya Khudia, Jianyu Huang, James Reed, Mikhail Z, Haixin Liu and Peter Vajda.
  • 3. Outline • Motivation • Quantization: Overview • Quantizing deep networks • Post Training quantization • Quantization aware training • Lower precision inference • Hardware accelerator recommendations • Model system co-design • Looking ahead
  • 4. Motivation(1) • Data-center power consumption is doubling every year Source: Deep Learning Inference in Facebook Data-Centers [1]
  • 5. Motivation(2) • Number of edge devices is growing rapidly, lots of these devices are resource constrained. Source: https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/
  • 6. Motivation(3) • While models are becoming more efficient, high accuracy still implies high complexity From: Benchmark Analysis of Representative Deep Neural Network Architectures, Simone Bianco et al,
  • 7. Quantization • Many approaches to solve the problems outlined here: • Better hardware accelerators: TPUs => Requires new custom hardware • Optimized kernels: Cudnn, Intel MKL-DNN • Efficient deep network architectures: Nasnet, Mobilenet, FBNet => Requires new architectures • A simpler approach that does not require re-design of models/new hardware is quantization. • Quantization refers to techniques to perform computation and storage at reduced precision • Works in combination with above approaches • Requires optimized kernels to efficiently use existing hardware.
  • 8. Background: Quantization(1) • Quantization refers to mapping values from fp32 to a lower precision format. • Specified by • Format • Mapping type • Granularity From:https://commons.wikimedia.org/w/index. php?curid=69415943 fp32 fp16 bfloat16 int8 int4 binary
  • 9. Background:Quantization(2) • We also consider different granularities of quantization: • Per layer quantization • Same mapping for all elements in a layer. • Per row/ Per-channel quantization: • Choose quantizer parameters independently for each row (fc layers) or for each conv kernel (conv layers) • Outlier aware quantization: • Separate outliers to use lower precision arithmetic for bulk of weights. • Dense computations for inliers with sparse computation for outliers that have a large magnitude
  • 10. Modeling quantization during training • Emulate quantization by quantizing and de-quantizing in succession • Values are still in floating point, but with reduced precision • 𝑥 𝑜𝑢𝑡 = 𝐹𝑎𝑘𝑒𝑄𝑢𝑎𝑛𝑡 𝑥 = 𝑠. 𝐶𝑙𝑎𝑚𝑝 𝑟𝑜𝑢𝑛𝑑 𝑥 𝑠 − 𝑧 + 𝑧 = 𝐷𝑒𝑄𝑢𝑎𝑛𝑡(𝑄𝑢𝑎𝑛𝑡 𝑥 ) • Can also model quantization as a stochastic rounding operation Fake Quantizer (top), showing the quantization of output values. Approximation for purposes of derivative calculation (bottom).
  • 11. Quantization:Benefits Benefits Quantization Applicability Broad applicability across models and use cases Support Supported by x86, Nvidia Volta, ARM, Mali, QDSP Software Support Kernel libraries widely available Memory Size 4x reduction Memory Bandwidth/Cache 4x reduction Compute 2x to 4x speedup, depending on ISA Power Typically 4x (dominated by memory access) o Comparing float32 implementations with 8 bit inference
  • 12. Quantization: Challenges Challenges Notes Mitigation Accuracy drop Loss in accuracy can be too high for certain applications Quantization aware training Kernel Support Wide variety of operators+multiple hardware platforms Improving software tool-chain (TVM) to handle varied backends. “Complexity” Non-trivial: Requires calibration/training in some cases Support in software packages: TensorRT, Tensorflow and Pytorch
  • 14. Model Quantization: Overview Train Convert for inference Graph Optimization Kernel Implementation Train Convert for inference Graph Optimization Kernel Implementation Quantization Train Convert for inference Graph Optimization Kernel Implementation Quantization Fake Quantization Quantization
  • 15. What to quantize? • Only quantize parts of network that contribute significantly to performance • Roofline analysis to identify compute vs memory bandwidth bound operations • May need to further reduce based on accuracy impact. • Multiple ways to quantize a network with different impact: Quantization scheme Memory bandwidth reduction (Weights) Memory bandwidth reduction (Activations) Compute Speedup Notes Weight only quantization to int8 4x 1x 1x Suitable for embedding lookups Dynamic quantization 4x 1x 2x Suitable for fc layers with small batches Static quantization (int32 accumulators) 4x 4x 2x Suited for all layers, important for convolutions Static quantization (int16 accumulators) 4x 4x 4x Requires lower precision weights/activations
  • 16. Post training quantization: Weight compression • Simplest quantization scheme is to compress the weights to lower precision • Requires no input data and can be done statically as part of preparing a model for inference • Hardware accelerators can benefit if de-compression is done after memory access • Trivial for case of fp16/int8 quantization of weights. • K-means compression is also supported in select platforms and is amenable to simple de-compression • Scatter-Gather operation in select processors • Supported in CoreML Train Convert for inference Graph Optimization Kernel Implementation Quantization
  • 17. Dynamic quantization • Dynamic quantization refers to schemes where the activations are read/written in fp32 and are dynamically quantized to lower precisions for compute. • Requires no calibration data • Data exchanged between operations is in floating point, so no need to worry about format conversion. • Provides performance improvements close to static quantization when memory access is dominated by weights • Suitable for inference in RNNs • Lesser gains for conv layers • Supported by: • Pytorch • Tensorflow Lite
  • 18. Quantizing weights and activations • Post training quantization refers to quantizing both weights and activations to reduced precision, typically int8. • Requires estimation of statistics of activations for determining quantizer parameters. • Quantizer parameters are determined by minimizing an error metric: • KL Divergence: TensorRT • Saturation error: Tensorflow Lite
  • 20. Setup • Standard classification model architectures • Evaluate classification on imagenet validation dataset. • Results obtained using Tensorflow, more details are at: https://arxiv.org/abs/1806.08342 • More results and support from pytorch for quantizations to be announced at Pytorch Devcon on October 10th.
  • 21. Post training quantization: Results Network Asymmetric, Per Layer Symmetric, Per Channel Asymmetric, Per Channel Activation only quantized Weight only, Symmetric, Per Channel Floating point Mobilenet- v2-1-224 0.001 0.698 0.697 0.7 0.698 0.719 Mobilenet- v1-1-224 0.001 0.591 0.703 0.708 0.591 0.709 Nasnet- Mobile 0.72 0.72 0.74 0.74 0.72 0.74 Inception-v3 0.78 0.78 0.78 0.78 0.78 0.78 Resnet-v1-50 0.75 0.75 0.75 0.751 0.75 0.752 • 8 bits for weights and activations is sufficient for common CV classification tasks • Smaller networks are “harder” to quantize • At 8 bits, accuracy drop is dominated by weight quantization
  • 23. Results Network Asymmetric, Per Layer, post training quant Symmetric, Per Channel, post training quant Asymmetric, Per Layer, QAT Symmetric, Per channel QAT Floating point Mobilenet-v2-1- 224 0.001 0.698 0.709 0.711 0.719 Mobilenet-v1-1- 224 0.001 0.591 0.70 0.707 0.709 Nasnet-Mobile 0.72 0.72 0.73 0.73 0.74 Inception-v3 0.78 0.78 0.78 0.78 0.78 Resnet-v1-50 0.75 0.75 0.75 0.751 0.752 • Quantization aware training provides the best accuracy and allows for simpler quantization schemes.
  • 24. Performance: Operator level benchmarks Server: FBGEMM (quantized) vs MKL-DNN (fp32)
  • 25. Performance: Model level benchmarks (Mobile: Tensorflow Lite) Mobile: Inference time: float vs quantized, TFLite, Pixel2 QNNPACK kernels provides an additional 2x speedup
  • 26. Lower precision inference(1) • Four bit precisions for weights provides good accuracy, but needs to be done selectively. • Larger networks are more robust to lower precision • Quantization aware training is critical • Selectively quantizing layers of a network to different precisions can reduce the accuracy drop 4-bit weights, 8 bit activations: Top-1 accuracy results
  • 27. Lower precision inference(2) • Different layers of a neural network have different sensitivity to quantization errors • Exciting work on Differentiable architecture search [7] for determining precision allocations across layers, showing excellent performance:
  • 28. Architecture trade-offs(1) • Clear tradeoff between number of parameters and robustness to quantization • One can also tradeoff number of feature maps vs precision • Having 2x number of feature maps at 4-bits is better than 8 bit quantization of the base network.
  • 29. Architecture tradeoffs (2) • Restricting ranges of activations apriori can hurt accuracy • Preferable to learn activation ranges instead of fixing them beforehand.
  • 30. Co-design: Training quantized models • Designing models that provide good quantization performance requires co- design of model architecture, training algorithms and hardware. • Specific training enhancements include: • Fine tune from floating point models for building quantized models. • Freeze batch normalization statistics update to exactly model inference for further benefits. • Model exact rounding arithmetic done in hardware during training • Stochastic quantization provides models robust to random perturbations of weights, but underperforms techniques that model quantization as done at inference. • Other enhancements to improve accuracy: • Use distillation to train quantized student from floating point teacher network [3]
  • 32. Hardware accelerator recommendations: Basics • Optimize memory bandwidth • First order predictor of power consumption • Don’t ignore activations: Most literature focusses on weights, activations can be very significant for large resolution inputs. • Fuse multiple operations • Have floating point support as a backup • Avoid switching compute to different hardware • Optimize for GEMM • Still the workhorse for most DNN applications • Support low precision inference • 8 is required, but supporting lower precision can provide really high throughput.
  • 33. Hardware accelerator recommendations: Software • Don’t forget the software toolchain! • Need to make it easy for customers to use hardware • Integration with Tensorflow/Pytorch is important • Writing optimized kernels for new hardware is hard • Most implementations optimize for a specific set of models, with poor performance for kernels needed for other models.
  • 34. Hardware accelerator recommendations: Differentiation • Build a strategy for operator support • Take a close look at TVM/MLIR efforts. • Code generation along with hand written kernels • To get the best out of hardware w.quantization • Provide exact models of HW kernels for integration with training frameworks • Consider other techniques beyond quantization • Sparsity • K-means compression • Dynamic/Adaptive execution • Don’t forget privacy • Secure aggregation/Homo-morphic encryption becoming increasingly important • Training at the edge: • Depending on applications, this can be very important for privacy/personalization
  • 35. References 1. J Park, M. Naumov et al, “Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications” 2. Simone Bianco et al Benchmark Analysis of Representative Deep Neural Network Architectures 3. A. Polino et al, “Model compression via distillation and quantization” 4. B.Jacob, S.Kligys et al, “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference” 5. M.Courbariaux, Y.Bengio et al, “Binaryconnect: Training deep neural networks with binary weights during propagations” 6. R.Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper” 7. B.Wu et al, “Mixed precision quantization of convnets via differentiable neural architecture search”