SlideShare una empresa de Scribd logo
1 de 61
Descargar para leer sin conexión
Deep Compression and EIE:
——Deep Neural Network Model Compression 

and Efficient Inference Engine
Song Han
CVA group, Stanford University
Mar 1, 2015
A few words about us
• Fourth year PhD with Prof. Bill Dally at Stanford.
• Research interest is computer architecture for deep
learning, to improve the energy efficiency of neural
networks running on mobile and embedded systems.
• Recent work on “Deep Compression” and “EIE: Efficient
Inference Engine” covered by TheNextPlatform & O’ReillySong Han
Bill Dally
• Professor at Stanford University and former chairman of CS
department, leads the Concurrent VLSI Architecture Group.
• Chief Scientist of NVIDIA.
• Member of the National Academy of Engineering, Fellow of
the American Academy of Arts & Sciences, Fellow of the
IEEE, Fellow of the ACM…
This Talk:
• Deep Compression: A Deep Neural Network
Model Compression Pipeline.
• EIE Accelerator: Efficient Inference Engine
that Accelerates the Compressed Deep
Neural Network Model.
• => SW / HW co-design
[1]. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
[2]. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
[3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
[4]. Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
Deep Learning Next Wave of AI
Image
Recognition
Speech
Recognition
Natural Language
Processing
Applications
App developers suffers from the model size
“At Baidu, our #1 motivation for compressing networks is to bring down the size of the binary file.
As a mobile-first company, we frequently update various apps via different app stores. We've very
sensitive to the size of our binary files, and a feature that increases the binary size by 100MB
will receive much more scrutiny than one that increases it by 10MB.” —Andrew Ng
The Problem:
If Running DNN on Mobile…
Hardware engineer suffers from the model size

(embedded system, limited resource)
The Problem:
If Running DNN on Mobile…
The Problem:
Intelligent but Inefficient
Network
Delay
Power
Budget
User
Privacy
If Running DNN on the Cloud…
Solver 1: Deep Compression
Deep Neural Network Model Compression
Smaller Size
Compress Mobile App 

Size by 35x-50x
Accuracy
no loss of accuracy
improved accuracy

Speedup
make inference faster
Solve 2: EIE Accelerator
ASIC accelerator: EIE (Efficient Inference Engine)
Offline
No dependency on 

network connection
Real Time
No network delay
high frame rate
Low Power
High energy efficiency

that preserves battery
Deep Compression
• AlexNet: 35×, 240MB => 6.9MB => 0.52MB
• VGG16: 49× 552MB => 11.3MB
• Both with no loss of accuracy on ImageNet12
• Weights fits on-chip SRAM, taking 120x less energy than DRAM
1. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
2. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and
Huffman Coding, ICLR 2016
3. Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
1. Pruning
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
Pruning: Motivation
• Trillion of synapses are generated in the human brain during the first few months of birth.
• 1 year old, peaked at 1000 trillion
• Pruning begins to occur.
• 10 years old, a child has nearly 500 trillion synapses
• This ’pruning’ mechanism removes redundant connections in the brain.
[1] Christopher A Walsh. Peter huttenlocher (1931-2013). Nature, 502(7470):172–172, 2013. 

Retrain to Recover Accuracy
-4.5%
-4.0%
-3.5%
-3.0%
-2.5%
-2.0%
-1.5%
-1.0%
-0.5%
0.0%
0.5%
40% 50% 60% 70% 80% 90% 100%
AccuracyLoss
Parametes Pruned Away
L2 regularization w/o retrain L1 regularization w/o retrain
L1 regularization w/ retrain L2 regularization w/ retrain
L2 regularization w/ iterative prune and retrain
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
Pruning: Result on 4 Covnets
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
AlexNet & VGGNet
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
Mask Visualization
Visualization of the first FC layer’s sparsity pattern of
Lenet-300-100. It has a banded structure repeated 28 times,
which correspond to the un-pruned parameters in the center of
the images, since the digits are written in the center.
Pruning NeuralTalk and LSTM
[1] Thanks Shijian Tang pruning Neural Talk
• Original: a basketball player in a white
uniform is playing with a ball
• Pruned 90%: a basketball player in a white
uniform is playing with a basketball
• Original : a brown dog is running through a
grassy field
• Pruned 90%: a brown dog is running
through a grassy area
• Original : a soccer player in red is running
in the field
• Pruned 95%: a man in a red shirt and
black and white black shirt is running
through a field
• Original : a man is riding a surfboard on a
wave
• Pruned 90%: a man in a wetsuit is riding a
wave on a beach
Speedup (FC layer)
• Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV
• NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV
• NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV
Energy Efficiency (FC layer)
• Intel Core i7 5930K: CPU socket and DRAM power are reported by pcm-power utility
• NVIDIA GeForce GTX Titan X: reported by nvidia-smi utility
• NVIDIA Tegra K1: measured the total power consumption with a power-meter, 15% AC
to DC conversion loss, 85% regulator efficiency and 15% power consumed by
peripheral components => 60% AP+DRAM power
2. Weight Sharing
(Trained Quantization)
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained
Quantization and Huffman Coding, ICLR 2016
Weight Sharing: Overview
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained
Quantization and Huffman Coding, ICLR 2016
Finetune Centroids
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained
Quantization and Huffman Coding, ICLR 2016
Accuracy ~ #Bits on 5 Conv Layer + 3 FC Layer
Quantization: Result
• 16 Million => 2^4=16
• 8/5 bit quantization results in no accuracy loss
• 8/4 bit quantization results in no top-5 accuracy loss,
0.1% top-1 accuracy loss
• 4/2 bit quantization results in -1.99% top-1 accuracy
loss, and-2.60% top-5 accuracy loss, not that bad-:
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained
Quantization and Huffman Coding, ICLR 2016
Reduced Precision for Inference
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained
Quantization and Huffman Coding, ICLR 2016
Pruning and Quantization Works Well Together
Figure 6: Accuracy v.s. compression rate under different compression methods. Pruning and
quantization works best when combined.
Figure 7: Pruning doesn’t hurt quantization. Dashed: quantization on unpruned network. Solid:
quantization on pruned network; Accuracy begins to drop at the same number of quantization bits
whether or not the network has been pruned. Although pruning made the number of parameters less,
quantization still works well, or even better(3 bits case on the left figure) as in the unpruned network.
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained
Quantization and Huffman Coding, ICLR 2016
3. Huffman Coding
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained
Quantization and Huffman Coding, ICLR 2016
Huffman Coding
Huffman code is a type of optimal prefix code that is commonly used for
loss-less data compression. It produces a variable-length code table for
encoding source symbol. The table is derived from the occurrence
probability for each symbol. As in other entropy encoding methods, more
common symbols are represented with fewer bits than less common
symbols, thus save the total space.
Deep Compression Result on 4 Convnets
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained
Quantization and Huffman Coding, ICLR 2016
Result: AlexNet
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained
Quantization and Huffman Coding, ICLR 2016
AlexNet: Breakdown
Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained
Quantization and Huffman Coding, ICLR 2016
New Network Topology 

+ Deep Compression
The Big Gun:
Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
520KB model, AlexNet-accuracy
(1) Smaller DNNs require less communication across servers during distributed training. 

(2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car.
(3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. 

(4) Smaller DNNs make it easier to download deep learning-powered Apps from App Store.
Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
Conclusion
• We have presented a method to compress neural networks without
affecting accuracy by finding the right connections and quantizing the
weights.
• Pruning the unimportant connections => quantizing the network and
enforce weight sharing => apply Huffman encoding.
• We highlight our experiments on ImageNet, and reduced the weight
storage by 35×, VGG16 by 49×, without loss of accuracy.
• Now weights can fit in cache
Product: A Model Compression Tool for 

Deep Learning Developers
• Easy Version:
✓ No training needed
✓ Fast
x 5x - 10x compression rate
x 1% loss of accuracy
• Advanced Version:
✓ 35x - 50x compression rate
✓ no loss of accuracy
x Training is needed
x Slow
EIE: Efficient Inference Engine on
Compressed Deep Neural
Network
Song Han
CVA group, Stanford University
Jan 6, 2015
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
ASIC Accelerator that Runs DNN on Mobile
Offline
No dependency on 

network connection
Real Time
No network delay
high frame rate
Low Power
High energy efficiency

that preserves battery
Solution: Everything on Chip
• We present the sparse, indirectly indexed, weight shared MxV
accelerator.
• Large DNN models fit on-chip SRAM, 120× energy savings.
• EIE exploits the sparsity of activations (30% non-zero).
• EIE works on compressed model (30x model reduction)
• Distributed both storage and computation across multiple PEs,
which achieves load balance and good scalability.
• Evaluated EIE on a wide range of deep learning models,
including CNN for object detection, LSTM for natural language
processing and image captioning. We also compare EIE to
CPUs, GPUs, and other accelerators.
Distribute Storage and Processing
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
Central Control
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
Inside each PE:
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
Evaluation
1. Cycle-accurate C++ simulator. Two abstract methods: Propagate and
Update. Used for DSE and verification.
2. RTL in Verilog, verified its output result with the golden model in
Modelsim.
3. Synthesized EIE using the Synopsys Design Compiler (DC) under the
TSMC 45nm GP standard VT library with worst case PVT corner.
4. Placed and routed the PE using the Synopsys IC compiler (ICC). We
used Cacti to get SRAM area and energy numbers.
5. Annotated the toggle rate from the RTL simulation to the gate-level
netlist, which was dumped to switching activity interchange format
(SAIF), and estimated the power using Prime-Time PX.
Baseline and Benchmark
• CPU: Intel Core-i7 5930k
• GPU: NVIDIA TitanX GPU
• Mobile GPU: Jetson TK1 with NVIDIA
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
Layout of an EIE PE
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
Result: Speedup / Energy Efficiency
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
Where are the savings from?
• Three factors for energy saving:
• Matrix is compressed by 35×; 

less work to do; less bricks to carry
• DRAM => SRAM, no need to go off-chip: 120×;

carry bricks from SF to NY => SF to SJ
• Sparse activation: 3×;

lighter bricks to carry
Result: Speedup
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
Unit: us
Scalability
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
Useful Computation / Load Balance
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
Load Balance
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
Design Space Exploration
Media Coverage: TheNextPlatform
http://www.nextplatform.com/2015/12/08/emergent-chip-vastly-accelerates-deep-neural-networks/
Media Coverage: O’Reilly
https://www.oreilly.com/ideas/compressed-representations-in-the-age-of-big-data
Media Coverage: Hacker News
https://news.ycombinator.com/item?id=10881683
Conclusion
• We present EIE, an energy-efficient engine optimized to
operate on compressed deep neural networks.
• By leveraging sparsity in both the activations and the
weights, EIE reduces the energy needed to compute a
typical FC layer by 3,000×.
• With wrapper logic on top of EIE, 1x1 convolution and 3x3
convolution is possible.
Hardware for Deep Learning
PC Mobile Intelligent Mobile
?
Computation
Mobile 

Computation
Intelligent 

Mobile 

Computation
Thank you!
songhan@stanford.edu
[1]. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
[2]. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
[3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
[4]. Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
backup slides
A robot asking the cloud is like a child asking mom.
That’s not the real intelligence.

Más contenido relacionado

Más de Edge AI and Vision Alliance

“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...Edge AI and Vision Alliance
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsightsEdge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...Edge AI and Vision Alliance
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...Edge AI and Vision Alliance
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...Edge AI and Vision Alliance
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...Edge AI and Vision Alliance
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from SamsaraEdge AI and Vision Alliance
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...Edge AI and Vision Alliance
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...Edge AI and Vision Alliance
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...Edge AI and Vision Alliance
 
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...Edge AI and Vision Alliance
 
“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...
“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...
“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...Edge AI and Vision Alliance
 
“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...
“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...
“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...Edge AI and Vision Alliance
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic LeapEdge AI and Vision Alliance
 
"Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ...
"Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ..."Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ...
"Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ...Edge AI and Vision Alliance
 
“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...
“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...
“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...Edge AI and Vision Alliance
 

Más de Edge AI and Vision Alliance (20)

“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
“Tracking and Fusing Diverse Risk Factors to Drive a SAFER Future,” a Present...
 
“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...
“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...
“MIPI CSI-2 Image Sensor Interface Standard Features Enable Efficient Embedde...
 
“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...
“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...
“Introduction to the CSI-2 Image Sensor Interface Standard,” a Presentation f...
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
 
"Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ...
"Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ..."Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ...
"Optimizing Image Quality and Stereo Depth at the Edge," a Presentation from ...
 
“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...
“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...
“Using a Collaborative Network of Distributed Cameras for Object Tracking,” a...
 

Último

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Último (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

"Techniques for Efficient Implementation of Deep Neural Networks," a Presentation from Stanford

  • 1. Deep Compression and EIE: ——Deep Neural Network Model Compression 
 and Efficient Inference Engine Song Han CVA group, Stanford University Mar 1, 2015
  • 2. A few words about us • Fourth year PhD with Prof. Bill Dally at Stanford. • Research interest is computer architecture for deep learning, to improve the energy efficiency of neural networks running on mobile and embedded systems. • Recent work on “Deep Compression” and “EIE: Efficient Inference Engine” covered by TheNextPlatform & O’ReillySong Han Bill Dally • Professor at Stanford University and former chairman of CS department, leads the Concurrent VLSI Architecture Group. • Chief Scientist of NVIDIA. • Member of the National Academy of Engineering, Fellow of the American Academy of Arts & Sciences, Fellow of the IEEE, Fellow of the ACM…
  • 3. This Talk: • Deep Compression: A Deep Neural Network Model Compression Pipeline. • EIE Accelerator: Efficient Inference Engine that Accelerates the Compressed Deep Neural Network Model. • => SW / HW co-design [1]. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 [2]. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016 [4]. Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
  • 4. Deep Learning Next Wave of AI Image Recognition Speech Recognition Natural Language Processing
  • 6. App developers suffers from the model size “At Baidu, our #1 motivation for compressing networks is to bring down the size of the binary file. As a mobile-first company, we frequently update various apps via different app stores. We've very sensitive to the size of our binary files, and a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB.” —Andrew Ng The Problem: If Running DNN on Mobile…
  • 7. Hardware engineer suffers from the model size
 (embedded system, limited resource) The Problem: If Running DNN on Mobile…
  • 8. The Problem: Intelligent but Inefficient Network Delay Power Budget User Privacy If Running DNN on the Cloud…
  • 9. Solver 1: Deep Compression Deep Neural Network Model Compression Smaller Size Compress Mobile App 
 Size by 35x-50x Accuracy no loss of accuracy improved accuracy
 Speedup make inference faster
  • 10. Solve 2: EIE Accelerator ASIC accelerator: EIE (Efficient Inference Engine) Offline No dependency on 
 network connection Real Time No network delay high frame rate Low Power High energy efficiency
 that preserves battery
  • 11. Deep Compression • AlexNet: 35×, 240MB => 6.9MB => 0.52MB • VGG16: 49× 552MB => 11.3MB • Both with no loss of accuracy on ImageNet12 • Weights fits on-chip SRAM, taking 120x less energy than DRAM 1. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 2. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 3. Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
  • 12. 1. Pruning Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
  • 13. Pruning: Motivation • Trillion of synapses are generated in the human brain during the first few months of birth. • 1 year old, peaked at 1000 trillion • Pruning begins to occur. • 10 years old, a child has nearly 500 trillion synapses • This ’pruning’ mechanism removes redundant connections in the brain. [1] Christopher A Walsh. Peter huttenlocher (1931-2013). Nature, 502(7470):172–172, 2013. 

  • 14. Retrain to Recover Accuracy -4.5% -4.0% -3.5% -3.0% -2.5% -2.0% -1.5% -1.0% -0.5% 0.0% 0.5% 40% 50% 60% 70% 80% 90% 100% AccuracyLoss Parametes Pruned Away L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
  • 15. Pruning: Result on 4 Covnets Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
  • 16. AlexNet & VGGNet Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
  • 17. Mask Visualization Visualization of the first FC layer’s sparsity pattern of Lenet-300-100. It has a banded structure repeated 28 times, which correspond to the un-pruned parameters in the center of the images, since the digits are written in the center.
  • 18. Pruning NeuralTalk and LSTM [1] Thanks Shijian Tang pruning Neural Talk
  • 19. • Original: a basketball player in a white uniform is playing with a ball • Pruned 90%: a basketball player in a white uniform is playing with a basketball • Original : a brown dog is running through a grassy field • Pruned 90%: a brown dog is running through a grassy area • Original : a soccer player in red is running in the field • Pruned 95%: a man in a red shirt and black and white black shirt is running through a field • Original : a man is riding a surfboard on a wave • Pruned 90%: a man in a wetsuit is riding a wave on a beach
  • 20. Speedup (FC layer) • Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV • NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV • NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV
  • 21. Energy Efficiency (FC layer) • Intel Core i7 5930K: CPU socket and DRAM power are reported by pcm-power utility • NVIDIA GeForce GTX Titan X: reported by nvidia-smi utility • NVIDIA Tegra K1: measured the total power consumption with a power-meter, 15% AC to DC conversion loss, 85% regulator efficiency and 15% power consumed by peripheral components => 60% AP+DRAM power
  • 22. 2. Weight Sharing (Trained Quantization) Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
  • 23. Weight Sharing: Overview Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
  • 24. Finetune Centroids Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
  • 25. Accuracy ~ #Bits on 5 Conv Layer + 3 FC Layer
  • 26. Quantization: Result • 16 Million => 2^4=16 • 8/5 bit quantization results in no accuracy loss • 8/4 bit quantization results in no top-5 accuracy loss, 0.1% top-1 accuracy loss • 4/2 bit quantization results in -1.99% top-1 accuracy loss, and-2.60% top-5 accuracy loss, not that bad-: Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
  • 27. Reduced Precision for Inference Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
  • 28. Pruning and Quantization Works Well Together Figure 6: Accuracy v.s. compression rate under different compression methods. Pruning and quantization works best when combined. Figure 7: Pruning doesn’t hurt quantization. Dashed: quantization on unpruned network. Solid: quantization on pruned network; Accuracy begins to drop at the same number of quantization bits whether or not the network has been pruned. Although pruning made the number of parameters less, quantization still works well, or even better(3 bits case on the left figure) as in the unpruned network. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
  • 29. 3. Huffman Coding Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
  • 30. Huffman Coding Huffman code is a type of optimal prefix code that is commonly used for loss-less data compression. It produces a variable-length code table for encoding source symbol. The table is derived from the occurrence probability for each symbol. As in other entropy encoding methods, more common symbols are represented with fewer bits than less common symbols, thus save the total space.
  • 31. Deep Compression Result on 4 Convnets Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
  • 32. Result: AlexNet Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
  • 33. AlexNet: Breakdown Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016
  • 34. New Network Topology 
 + Deep Compression The Big Gun: Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
  • 35. Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
  • 36. 520KB model, AlexNet-accuracy (1) Smaller DNNs require less communication across servers during distributed training. 
 (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. 
 (4) Smaller DNNs make it easier to download deep learning-powered Apps from App Store. Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
  • 37. Conclusion • We have presented a method to compress neural networks without affecting accuracy by finding the right connections and quantizing the weights. • Pruning the unimportant connections => quantizing the network and enforce weight sharing => apply Huffman encoding. • We highlight our experiments on ImageNet, and reduced the weight storage by 35×, VGG16 by 49×, without loss of accuracy. • Now weights can fit in cache
  • 38. Product: A Model Compression Tool for 
 Deep Learning Developers • Easy Version: ✓ No training needed ✓ Fast x 5x - 10x compression rate x 1% loss of accuracy • Advanced Version: ✓ 35x - 50x compression rate ✓ no loss of accuracy x Training is needed x Slow
  • 39. EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han CVA group, Stanford University Jan 6, 2015 Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
  • 40. ASIC Accelerator that Runs DNN on Mobile Offline No dependency on 
 network connection Real Time No network delay high frame rate Low Power High energy efficiency
 that preserves battery
  • 41. Solution: Everything on Chip • We present the sparse, indirectly indexed, weight shared MxV accelerator. • Large DNN models fit on-chip SRAM, 120× energy savings. • EIE exploits the sparsity of activations (30% non-zero). • EIE works on compressed model (30x model reduction) • Distributed both storage and computation across multiple PEs, which achieves load balance and good scalability. • Evaluated EIE on a wide range of deep learning models, including CNN for object detection, LSTM for natural language processing and image captioning. We also compare EIE to CPUs, GPUs, and other accelerators.
  • 42. Distribute Storage and Processing PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Central Control Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
  • 43. Inside each PE: Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
  • 44. Evaluation 1. Cycle-accurate C++ simulator. Two abstract methods: Propagate and Update. Used for DSE and verification. 2. RTL in Verilog, verified its output result with the golden model in Modelsim. 3. Synthesized EIE using the Synopsys Design Compiler (DC) under the TSMC 45nm GP standard VT library with worst case PVT corner. 4. Placed and routed the PE using the Synopsys IC compiler (ICC). We used Cacti to get SRAM area and energy numbers. 5. Annotated the toggle rate from the RTL simulation to the gate-level netlist, which was dumped to switching activity interchange format (SAIF), and estimated the power using Prime-Time PX.
  • 45. Baseline and Benchmark • CPU: Intel Core-i7 5930k • GPU: NVIDIA TitanX GPU • Mobile GPU: Jetson TK1 with NVIDIA Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
  • 46. Layout of an EIE PE Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
  • 47. Result: Speedup / Energy Efficiency Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
  • 48. Where are the savings from? • Three factors for energy saving: • Matrix is compressed by 35×; 
 less work to do; less bricks to carry • DRAM => SRAM, no need to go off-chip: 120×;
 carry bricks from SF to NY => SF to SJ • Sparse activation: 3×;
 lighter bricks to carry
  • 49. Result: Speedup Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016 Unit: us
  • 50. Scalability Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
  • 51. Useful Computation / Load Balance Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
  • 52. Load Balance Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016
  • 56. Media Coverage: Hacker News https://news.ycombinator.com/item?id=10881683
  • 57. Conclusion • We present EIE, an energy-efficient engine optimized to operate on compressed deep neural networks. • By leveraging sparsity in both the activations and the weights, EIE reduces the energy needed to compute a typical FC layer by 3,000×. • With wrapper logic on top of EIE, 1x1 convolution and 3x3 convolution is possible.
  • 58. Hardware for Deep Learning PC Mobile Intelligent Mobile ? Computation Mobile 
 Computation Intelligent 
 Mobile 
 Computation
  • 59. Thank you! songhan@stanford.edu [1]. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 [2]. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, ICLR 2016 [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network” arXiv 2016 [4]. Iandola et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
  • 61. A robot asking the cloud is like a child asking mom. That’s not the real intelligence.