SlideShare una empresa de Scribd logo
1 de 73
Descargar para leer sin conexión
Copyright © 2018 Massachusetts Institute of Technology 1
Vivienne Sze
May 23, 2018
Approaches for Energy Efficient
Implementation of Deep Neural Networks
In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang
Copyright © 2018 Massachusetts Institute of Technology 2
Video is the Biggest Big Data
Need	energy-efficient	pixel	processing!	
Over	70%	of	today’s	Internet	traffic	is	video	
Over	300	hours	of	video	uploaded	to	YouTube	every	minute	
Over	500	million	hours	of	video	surveillance	collected	every	day	
Energy	limited	due	
to	ba1ery	capacity	
Power	limited	due	
to	heat	dissipa8on
Copyright © 2018 Massachusetts Institute of Technology 3
Increased Accuracy with Deep Learning
Deep	Learning	
requires	significantly	
more	computa5on	
than	previous	
approaches	
0
5
10
15
20
25
30
2010 2011 2012 2013 2014 2015 Human
ImageNet	Top	5	Classifica3on	Error	(%)	
Large	error	reduc+on	
due	to	Deep	Learning	
Hand-cra?ed	feature-
based	designs	
Deep	Learning-
based	designs	
[O.	Russakovsky	et	al.,	
IJCV,	2015]
Copyright © 2018 Massachusetts Institute of Technology 4
Deep Convolutional Neural Networks
Classes
FC
Layers
Modern deep CNN: up to 1000 CONV layers
CONV
Layer
CONV
Layer
Low-level
Features
High-level
Features
Copyright © 2018 Massachusetts Institute of Technology 5
CONV
Layer
CONV
Layer
Low-level
Features
High-level
Features
Classes
FC
Layers
1 – 3 layers
Deep Convolutional Neural Networks
Copyright © 2018 Massachusetts Institute of Technology 6
Deep Convolutional Neural Networks
Classes
CONV
Layer
CONV
Layer
FC
Layers
Convolutions account for more
than 90% of overall computation,
dominating runtime and energy
consumption
Copyright © 2018 Massachusetts Institute of Technology 7
High-Dimensional CNN Convolution
R
S
H
a plane of input activations
a.k.a. input feature map (fmap)
filter (weights)
W
Copyright © 2018 Massachusetts Institute of Technology 8
High-Dimensional CNN Convolution
R
filter (weights)
S
E
F
Partial Sum (psum)
Accumulation
input fmap output fmap
Element-wise
Multiplication
H
W
an output
activation
Copyright © 2018 Massachusetts Institute of Technology 9
High-Dimensional CNN Convolution
H
R
filter (weights)
S
E
Sliding Window Processing
input fmap
an output
activation
output fmap
W F
Copyright © 2018 Massachusetts Institute of Technology 10
High-Dimensional CNN Convolution
…
E
output fmap
……
many
filters (M)
Many
Output Channels (M)
M
…R
S
1
R
S
…
……
C
…
M
H
input fmap
…
……
…
C
…
C
……
…
W F
Copyright © 2018 Massachusetts Institute of Technology 11
High-Dimensional CNN Convolution
…
M
…
Many
Input fmaps (N) Many
Output fmaps (N)
…R
S
R
S
…
……
C
…
C
……
…
filters
…
E
F
……
H
……
C
…
H
W
…
……
…
C
…
…
E
……
1 1
N
N
W F
Image		
batch	size:		
1	–	256	(N)
Copyright © 2018 Massachusetts Institute of Technology 12
Large Size with Varying Shapes
Layer Filter Size (R) # Filters (M) # Channels (C) Stride
1 11x11 96 3 4
2 5x5 256 48 1
3 3x3 384 256 1
4 3x3 384 192 1
5 3x3 256 192 1
AlexNet	Convolu-onal	Layer	Configura-ons	
[Krizhevsky,	NIPS	2012]	
34k	Params	 307k	Params	 885k	Params	
Layer	1	 Layer	2	 Layer	3	
105M	MACs	 224M	MACs	 150M	MACs
Copyright © 2018 Massachusetts Institute of Technology 13
Popular DNNs
•  LeNet	(1998)	
•  AlexNet	(2012)	
•  OverFeat	(2013)	
•  VGGNet	(2014)	
•  GoogleNet	(2014)	
•  ResNet	(2015)	
0
2
4
6
8
10
12
14
16
18
2012 2013 2014 2015 Human
Accuracy(Top5error)
[O. Russakovsky et al., IJCV 2015]
AlexNet	
OverFeat	
GoogLeNet	
ResNet	
Clarifai	
VGGNet	
ImageNet:	Large	Scale	Visual	RecogniFon	
Challenge	(ILSVRC)
Copyright © 2018 Massachusetts Institute of Technology 14
Popular DNNs
Metrics LeNet-5 AlexNet VGG-16 GoogLeNet
(v1)
ResNet-50
Top-5 error n/a 16.4 7.4 6.7 5.3
Input Size 28x28 227x227 224x224 224x224 224x224
# of CONV Layers 2 5 16 21 (depth) 49
# of Weights 2.6k 2.3M 14.7M 6.0M 23.5M
# of MACs 283k 666M 15.3G 1.43G 3.86G
# of FC layers 2 3 3 1 1
# of Weights 58k 58.6M 124M 1M 2M
# of MACs 58k 58.6M 124M 1M 2M
Total Weights 60k 61M 138M 7M 25.5M
Total MACs 341k 724M 15.5G 1.43G 3.9G
CONV Layers increasingly important!
Copyright © 2018 Massachusetts Institute of Technology 15
Training versus Inference
Training
(determine weights)
Weights
Large Datasets
Inference
(use weights)
Copyright © 2018 Massachusetts Institute of Technology 16
•  Accuracy
•  Well defined dataset, DNN Model and task
•  Programmability
•  Support various DNN Models with different filter weights
•  Energy/Power:
•  Energy per operation and DRAM Bandwidth
•  Throughput/Latency
•  GOPS, frame rate, delay, batch size
•  Cost
•  Area (memory and logic size)
Key Metrics
ImageNet	
DRAM
Chip	
Computer		
Vision	
Speech		
Recogni6on	
[Sze et al., CICC 2017]
Copyright © 2018 Massachusetts Institute of Technology 17
GPUs and CPUs Targeting Deep Learning
Xeon Phi “optimized for deep learning”
Intel Knights Landing (2016)
Intel Knights Mills (2017)
Nvidia PASCAL GP100 (2016)
Nvidia VOLTA GV100 (2017)
Use matrix multiplication libraries on CPUs and GPUs
Copyright © 2018 Massachusetts Institute of Technology 18
Accelerate Matrix Multiplication
•  Implementation: Matrix Multiplication (GEMM)
•  CPU: OpenBLAS, Intel MKL, etc
•  GPU: cuBLAS, cuDNN, etc
•  Optimized by tiling to storage hierarchy
Copyright © 2018 Massachusetts Institute of Technology 19
Map DNN to a Matrix Multiplication
•  Convert to matrix mult. using the Toeplitz Matrix
1 2 3
4 5 6
7 8 9
1 2
3 4
Filter Input Fmap Output Fmap
* = 1 2
3 4
1 2 3 41 2 4 5
2 3 5 6
4 5 7 8
5 6 8 9
1 2 3 4 × =
Toeplitz Matrix
(w/ redundant data)
Convolution:
Matrix Mult:
1 2 4 5
2 3 5 6
4 5 7 8
5 6 8 9
Data is repeated
Goal: Reduced number
of operations to
increase throughput
Copyright © 2018 Massachusetts Institute of Technology 20
•  Goal: Bitwise same result, but reduce number of operations
•  Focuses mostly on compute
Computation Transformations
Copyright © 2018 Massachusetts Institute of Technology 21
Analogy: Gauss’s Multiplication Algorithm
4 multiplications + 3 additions
3 multiplications + 5 additions
Reduce number of
multiplications, but increase
number of additions
Copyright © 2018 Massachusetts Institute of Technology 22
Reduce Operations in Matrix Multiplication
•  Winograd [Lavin, CVPR 2016]
–  Pro: 2.25x speed up for 3x3 filter
–  Con: Specialized processing depending on filter size
•  Fast Fourier Transform [Mathieu, ICLR 2014]
–  Pro: Direct convolution O(No
2Nf
2) to O(No
2log2No)
–  Con: Increase storage requirements
•  Strassen [Cong, ICANN 2014]
–  Pro: O(N3) to (N2.807)
–  Con: Numerical stability
Copyright © 2018 Massachusetts Institute of Technology 23
cuDNN: Speed up with Transformations
Source: Nvidia
Copyright © 2018 Massachusetts Institute of Technology 24
Designing Specialized Hardware
(Accelerators) for DNNs
Copyright © 2018 Massachusetts Institute of Technology 25
Properties We Can Leverage
•  Operations exhibit high parallelism
à high throughput possible
•  Memory Access is the Bottleneck
ALU
Memory Read Memory WriteMAC*
* multiply-and-accumulate
filter weight
image pixel
partial sum
updated
partial sum
200x 1x
DRAM DRAM
Copyright © 2018 Massachusetts Institute of Technology 26
Properties We Can Leverage
•  Operations exhibit high parallelism
à high throughput possible
•  Input data reuse opportunities (up to 500x)
à exploit low-cost memory
Convolu'onal	
Reuse		
(pixels,	weights)	
Filter	
Image	
…
	
…
	
…
	
…
	
…
	
…
	
…
	
…
	
…
	
Image	
Reuse	
(pixels)	
	
…
		
…
		
…
		
…
	
…
		
…
		
…
		
…
		
…
		
…
		
…
		
2
1
Filters	
Image	
Filter	
Reuse	
(weights)	
	
…
	
…
	
…
	
…
		
…
		
…
		
…
		
…
		
…
		
…
		
…
	
Filter	
Images	
2
1
Copyright © 2018 Massachusetts Institute of Technology 27
Highly-Parallel Compute Paradigms
Temporal Architecture
(SIMD/SIMT)
Register File
Memory Hierarchy
Spatial Architecture
(Dataflow Processing)
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Control
Memory Hierarchy
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
Copyright © 2018 Massachusetts Institute of Technology 28
Advantages of Spatial Architecture
Temporal Architecture
(SIMD/SIMT)
Register File
Memory Hierarchy
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Control
Spatial Architecture
(Dataflow Processing)
Memory Hierarchy
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
Efficient	Data	Reuse	
Distributed	local	storage	(RF)	
Inter-PE	Communica5on	
Sharing	among	regions	of	PEs	
Processing	
Element	(PE)	
Control
Reg File0.5 – 1.0 kB
Copyright © 2018 Massachusetts Institute of Technology 29
Data Movement is Expensive
Maximize	data	reuse	at	low	
cost	levels	of	hierarchy	
DRAM
Global
Buffer
PE
PE PE
ALU fetch data to run
a MAC here
ALU
Buffer ALU
RF ALU
Normalized Energy Cost*
200×
6×
PE ALU 2×
1×
1× (Reference)
DRAM ALU
0.5 – 1.0 kB
100 – 500 kB
NoC: 200 – 1000 PEs
* measured from a commercial 65nm process
Copyright © 2018 Massachusetts Institute of Technology 30
Weight Stationary (WS)
Global Buffer
W0 W1 W2 W3 W4 W5 W6 W7
Psum Activation
PE
Weight
•  Minimize weight read energy consumption
−  maximize convolutional and filter reuse of weights
•  Examples: [nn-X (NeuFlow), CVPRW 2014] [Park, ISSCC 2015]
[Origami, GLSVLSI 2015] [Google TPU, ISCA 2017]
Copyright © 2018 Massachusetts Institute of Technology 31
Output Stationary (OS)
Global Buffer
P0 P1 P2 P3 P4 P5 P6 P7
Pixel Weight
PE
Psum
•  Minimize partial sum R/W energy consumption
−  maximize local accumulation
•  Examples: [Gupta, ICML 2015] [ShiDianNao, ISCA 2015]
[ENVISION, ISSCC 2017] [Thinker, JSSC 2017]
Copyright © 2018 Massachusetts Institute of Technology 32
No Local Reuse (NLR)
PE
Pixel
Psum
Global Buffer
Weight
•  Use a large global buffer as shared storage
−  Reduce DRAM access energy consumption
•  Examples: [DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014]
[Zhang, FPGA 2015]
Copyright © 2018 Massachusetts Institute of Technology 33
Row Stationary Dataflow
PE 1
Row 1 Row 1
PE 2
Row 2 Row 2
PE 3
Row 3 Row 3
Row 1
=
*
PE 4
Row 1 Row 2
PE 5
Row 2 Row 3
PE 6
Row 3 Row 4
Row 2
=
*
PE 7
Row 1 Row 3
PE 8
Row 2 Row 4
PE 9
Row 3 Row 5
Row 3
=
*
* * *
* * *
* * *
Optimize for overall
energy efficiency
instead for only a
certain data type
Copyright © 2018 Massachusetts Institute of Technology 34
Dataflow Comparison: CONV Layers
0
0.5
1
1.5
2
Normalized
Energy/MAC
WS OSA OSB OSC NLR RS
psums
weights
pixels
DNN Dataflows
RS optimizes for the best
overall energy efficiency
resulting in a 1.4× – 2.5×
lower energy than other
dataflows
Copyright © 2018 Massachusetts Institute of Technology 35
Eyeriss Deep CNN Accelerator
Off-Chip DRAM
…
…
…
…
…
…
Decomp
Comp ReLU
Input Image
Output Image
Filter Filt
Img
Psum
Psum
Global
Buffer
SRAM
108KB
64 bits
DCNN Accelerator
14×12 PE Array
Link Clock Core Clock
[Chen et al., ISSCC 2016]
4000 µm
4000µm
Global
Buffer
Spatial Array
(168 PEs)
Fabricated in a 65nm process
AlexNet @ 35 fps while consuming 278mW
>10x more energy efficient than a mobile GPU
Copyright © 2018 Massachusetts Institute of Technology 36
Features: Energy versus Accuracy
0.1
1
10
100
1000
10000
0 20 40 60 80
Accuracy	(Average	Precision)	
Energy/	
Pixel	(nJ)	
VGG162	
AlexNet2	
HOG1	
Measured	in	65nm*	
1.  [Suleiman,	VLSI	2016]	
2.  [Chen,	ISSCC	2016]		
	
*	Only	feature	extrac6on.	Does	
not	include	data,	augmenta6on,	
ensemble	and	classifica6on	
energy,	etc.	
Measured	in	on	VOC	2007	Dataset	
1.  DPM	v5	[Girshick,	2012]	
2.  Fast	R-CNN	[Girshick,	CVPR	2015]		
Exponen6al	
Linear	
Video		
Compression	
[Suleiman et al., ISCAS 2017]
Copyright © 2018 Massachusetts Institute of Technology 37
Designing Efficient DNN Models
Copyright © 2018 Massachusetts Institute of Technology 38
•  Reduce size of operands for storage/compute
•  Floating point à Fixed point
•  Bit-width reduction
•  Non-linear quantization
•  Reduce number of operations for storage/compute
•  Exploit Activation Statistics (Compression)
•  Network Pruning
•  Compact Network Architectures
Approaches
Copyright © 2018 Massachusetts Institute of Technology 39
Commercial Products using 8-bit Integer
Nvidia’s Pascal (2016) Google’s TPU (2016)
Copyright © 2018 Massachusetts Institute of Technology 40
•  Reduce number of bits
•  Binary Nets [Courbariaux, NIPS 2015]
•  Reduce number of unique weights
•  Ternary Weight Nets [Li, arXiv 2016]
•  XNOR-Net [Rategari, ECCV 2016]
•  Non-Linear Quantization
•  LogNet [Lee, ICASSP 2017]
Reduced Precision in Research
Log Domain Quantization
Copyright © 2018 Massachusetts Institute of Technology 41
Sparsity in Feature Map
9 -1 -3
1 -5 5
-2 6 -1
Many zeros in output fmaps after ReLU
ReLU
9 0 0
1 0 5
0 6 0
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
CONV Layer
# of activations # of non-zero activations
(Normalized)
Copyright © 2018 Massachusetts Institute of Technology 42
Exploit Sparsity
== 0
Zero
Buff
Scratch Pad
Enable
Zero Data Skipping
Register	File	
No	R/W	 No	Switching	
Method	1:	Skip	memory	access	and	computa6on	
45%	energy	savings	
[Chen et al., ISSCC 2016]
Copyright © 2018 Massachusetts Institute of Technology 43
Exploit Sparsity
Method	2:	Compress	data	to	reduce	storage	and	data	movement	
0	
1	
2	
3	
4	
5	
6	
1	 2	 3	 4	 5	
DRAM	Access	(MB)	
AlexNet	Conv	Layer	
Uncompressed	 Compressed	
1 2 3 4 5
AlexNet Conv Layer
DRAM
Access
(MB)
0
2
4
6
1.2×
1.4×
1.7×
1.8×
1.9×
Uncompressed
Fmaps + Weights
RLE Compressed
Fmaps + Weights
[Chen et al., ISSCC 2016]
Copyright © 2018 Massachusetts Institute of Technology 44
Pruning – Make Weights Sparse
retraining
Op#mal	Brain	Damage	
[Lecun et al., NIPS 1989]
Prune	DNN	based	on	
magnitude	of	weights	
[Han et al., NIPS 2015]
Example: AlexNet
Weight Reduction:
CONV layers 2.7x, FC layers 9.9x
Overall Reduction:
Weights 9x, MACs 3x
Copyright © 2018 Massachusetts Institute of Technology 45
Network Architecture Design
Build Network with series of Small Filters
5x5 filter Two 3x3 filters
decompose
Apply sequentially
decompose
5x5 filter 5x1 filter
1x5 filter
Apply sequentially
GoogleNet/
Inception v3
VGG-16
separable
filters
Copyright © 2018 Massachusetts Institute of Technology 46
1x1 Bottleneck in Popular DNN models
compress
expand
ResNet
compress
GoogleNet SqueezeNet
Copyright © 2018 Massachusetts Institute of Technology 47
Understanding the Limitations of
Existing Energy-Efficient Design
Approaches for Deep Neural Networks
[Y.-H. Chen et al., SysML Conference, February 2018]
Copyright © 2018 Massachusetts Institute of Technology 48
Energy-Efficient Processing of DNNs
V.	Sze,	Y.-H.	Chen,		
T-J.	Yang,	J.	Emer,		
“Efficient	Processing	of	
Deep	Neural	Networks:		
A	Tutorial	and	Survey,”	
Proceedings	of	the	IEEE,	
Dec.	2017	
A	significant	amount	of	algorithm	and	hardware	research		
on	energy-efficient	processing	of	DNNs	
We	idenOfied	various	limitaOons	to	exisOng	approaches	
http://eyeriss.mit.edu/tutorial.html
Copyright © 2018 Massachusetts Institute of Technology 49
Design of Efficient DNN Algorithms
•  Popular	efficient	DNN	algorithm	approaches		
	
	
	
Network	Pruning	
C	
1	
1	
S	
R	
1	
R	
S	
C	
Compact	Network	Architectures	
Examples:	SqueezeNet,	MobileNet	
...	also	reduced	precision	
•  Focus	on	reducing	number	of	MACs	and	weights	
•  Does	it	translate	to	energy	savings?
Copyright © 2018 Massachusetts Institute of Technology 50
Energy-Evaluation Methodology
CNN Shape Configuration
(# of channels, # of filters, etc.)
CNN Weights and Input Data
[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]
CNN Energy Consumption
L1 L2 L3
Energy
…
Memory
Accesses
Optimization
# of MACs
Calculation
…
# acc. at mem. level 1
# acc. at mem. level 2
# acc. at mem. level n
# of MACs
Hardware Energy Costs of each
MAC and Memory Access
Ecomp
Edata
[Yang et al., CVPR 2017]
Energy estimation tool
available at
http://eyeriss.mit.edu
Copyright © 2018 Massachusetts Institute of Technology 51
Key Observations
•  Number of weights alone is not a good metric for energy
•  All data types should be considered
Output	Feature	Map	
43%	
Input	Feature	Map	
25%	
Weights	
22%	
Computa:on	
10%	
Energy	Consump:on	
of	GoogLeNet	
[Yang et al., CVPR 2017]
Copyright © 2018 Massachusetts Institute of Technology 52
Energy Consumption of Existing DNNs
AlexNet	 SqueezeNet	
GoogLeNet	
ResNet-50	
VGG-16	
77%	
79%	
81%	
83%	
85%	
87%	
89%	
91%	
93%	
5E+08	 5E+09	 5E+10	
Top-5	Accuracy	
Normalized	Energy	Consump9on	
Original	DNN	 [Yang et al., CVPR 2017]
Deeper CNNs with fewer weights do not necessarily consume
less energy than shallower CNNs with more weights
v1.0
Batch sizes between 44 to 48
Copyright © 2018 Massachusetts Institute of Technology 53
Magnitude-based Weight Pruning
AlexNet	 SqueezeNet	
GoogLeNet	
ResNet-50	
VGG-16	
AlexNet	
SqueezeNet	
77%	
79%	
81%	
83%	
85%	
87%	
89%	
91%	
93%	
5E+08	 5E+09	 5E+10	
Top-5	Accuracy	
Normalized	Energy	Consump9on	
Original	DNN	 Magnitude-based	Pruning	[6]	[Han	et	al.,	NIPS	2015]	
Reduce number of weights by removing small magnitude weights
v1.0
Copyright © 2018 Massachusetts Institute of Technology 54
Energy-Aware Pruning
AlexNet	 SqueezeNet	
GoogLeNet	
ResNet-50	
VGG-16	
AlexNet	
SqueezeNet	
AlexNet	 SqueezeNet	
GoogLeNet	
77%	
79%	
81%	
83%	
85%	
87%	
89%	
91%	
93%	
5E+08	 5E+09	 5E+10	
Top-5	Accuracy	
Normalized	Energy	Consump9on	
Original	DNN	 Magnitude-based	Pruning	[6]	 Energy-aware	Pruning	(This	Work)	
1.74x
[Yang et al., CVPR 2017]
Directly target energy and incorporate it into the optimization of
DNNs to provide greater energy savings
v1.0
Copyright © 2018 Massachusetts Institute of Technology 55
•  Automatically adapt
DNN to a mobile platform
to reach a target latency or
energy budget
•  Use empirical
measurements to guide
optimization (avoid
modeling of tool chain or
platform architecture)
NetAdapt: Platform-Aware DNN Adaptation
NetAdapt Measure
…
Network	Proposals	
Empirical	Measurements	
Metric Proposal A … Proposal Z
Latency 15.6 … 14.3
Energy 41 … 46
…
…
…
Pretrained	
Network	 Metric Budget
Latency 3.8
Energy 10.5
Budget	
Adapted	
Network	
…
…
Pla8orm	
A	 B	 C	 D	 Z	
[Yang et al., arXiv 2018]In collaboration with Google’s Mobile Vision Team
Copyright © 2018 Massachusetts Institute of Technology 56
Latency vs. Accuracy Tradeoff with NetAdapt
•  NetAdapt boosts the real inference speed of MobileNet by 1.7x with higher accuracy
+0.3% accuracy
1.7x faster
+0.3% accuracy
1.6x faster
*Tested on the ImageNet
dataset and a Google
Pixel 1 CPU
Copyright © 2018 Massachusetts Institute of Technology 57
Many Efficient DNN Design Approaches
Network	Pruning	
C
1
1
S
R
1
R
S
C
Compact	Network	Architectures	
10100101000000000101000000000100
01100110
Reduce	Precision	
32-bit float
8-bit fixed
Binary 0
No	guarantee	that	DNN	algorithm	
designer	will	use	a	given	approach.	
Need	flexible	hardware!
Copyright © 2018 Massachusetts Institute of Technology 58
•  Specialized DNN hardware often
rely on certain properties of DNN
in order to achieve high energy-
efficiency
•  Example: Reduce memory
access by amortizing across
MAC array
Existing DNN Architectures
58
MAC array
Weight
Memory
Activation
Memory
Weight
reuse
Activation
reuse
Copyright © 2018 Massachusetts Institute of Technology 59
•  Example: reuse depends on # of channels, feature map/batch size
•  Not efficient across all network architectures (e.g., compact DNNs)
Limitation of Existing DNN Architectures
59
MAC array
(spatial
accumulation)
Number of filters
(output channels)
Number of
input channels
MAC array
(temporal
accumulation)
Number of filters
(output channels)
feature map
or batch size
Copyright © 2018 Massachusetts Institute of Technology 60
(MAC/cycle)
(MAC/data)
Step 1: maximum workload parallelism
Step 2: maximum dataflow parallelism
Step 3: # of act. PEs under a finite PE array size
Number of PEs
Step 4: # of act. PEs under fixed PE array dims.
peak
perf.
Step 5: # of act. PEs under fixed storage cap.
workload operational intensity
Step 6: lower act. PE utilization due to insuff. avg. BW
Step 7: lower act. PE utilization due to insuff. inst. BW
Slope = BW to only act. PE
Eyexam: Understanding Sources of Inefficiencies
in DNN Accelerators
60
A systematic way to evaluate how each architectural decision affects
performance (throughput) for a given DNN workload
Tightens the roofline model
(Theoretical Peak Performance)
[Chen et al., In Submission]
Copyright © 2018 Massachusetts Institute of Technology 61
To efficiently support:
•  Wide range of filter shapes
•  Large and Compact
•  Different Layers
•  e.g., CONV and FC
•  Wide range of sparsity
•  Dense and Sparse
Eyeriss v2
On-chipBuffer
Spatial
PE Array
Eyeriss (v1)
[Chen et al. ISSCC 2016, ISCA 2016]
Copyright © 2018 Massachusetts Institute of Technology 62
Benchmarking Metrics for DNN Hardware
How can we compare designs?
V. Sze, Y.-H. Chen, T-J. Yang, J. Emer,
“Efficient Processing of Deep Neural Networks: A Tutorial and Survey,”
Proceedings of the IEEE, Dec. 2017
Copyright © 2018 Massachusetts Institute of Technology 63
•  Accuracy
•  Quality of result for a given task
•  Throughput
•  Analytics on high volume data
•  Real-time performance (e.g., video at 30 fps)
•  Latency
•  For interactive applications (e.g., autonomous navigation)
•  Energy and Power
•  Edge and embedded devices have limited battery capacity
•  Data centers have stringent power ceilings due to cooling costs
•  Hardware Cost
•  $$$
Metrics for DNN Hardware
Copyright © 2018 Massachusetts Institute of Technology 64
•  Accuracy
•  Difficulty of dataset and/or task should be considered
•  Throughput
•  Number of cores (include utilization along with peak performance)
•  Runtime for running specific DNN models
•  Latency
•  Include batch size used in evaluation
•  Energy and Power
•  Power consumption for running specific DNN models
•  Include external memory access
•  Hardware Cost
•  On-chip storage, number of cores, chip area + process technology
Specifications to Evaluate Metrics
Copyright © 2018 Massachusetts Institute of Technology 65
Example: Metrics of Eyeriss Chip
Metric Units Input
Name of CNN Model Text AlexNet
Top-5 error classification
on ImageNet
# 19.8
Supported Layers All CONV
Bits per weight # 16
Bits per input activation # 16
Batch Size # 4
Runtime ms 115.3
Power mW 278
Off-chip Access per
Image Inference
MBytes 3.85
Number of Images Tested # 100
ASIC Specs Input
Process Technology 65nm LP
TSMC (1.0V)
Total Core Area (mm2) 12.25
Total On-Chip Memory
(kB)
192
Number of Multipliers 168
Clock Frequency (MHz) 200
Core area (mm2) /
multiplier
0.073
On-Chip memory (kB) /
multiplier
1.14
Measured or Simulated Measured
Copyright © 2018 Massachusetts Institute of Technology 66
•  All metrics should be reported for fair evaluation of design tradeoffs
•  Examples of what can happen if certain metric is omitted:
•  Without the accuracy given for a specific dataset and task, one could run a
simple DNN and claim low power, high throughput, and low cost – however,
the processor might not be usable for a meaningful task
•  Without reporting the off-chip bandwidth, one could build a processor with
only multipliers and claim low cost, high throughput, high accuracy, and low
chip power – however, when evaluating system power, the off-chip memory
access would be substantial
•  Are results measured or simulated? On what test data?
Comprehensive Coverage
Copyright © 2018 Massachusetts Institute of Technology 67
The evaluation process for whether a DNN system is a viable solution for a given
application might go as follows:
1.  Accuracy determines if it can perform the given task
2.  Latency and throughput determine if it can run fast enough and in real-time
3.  Energy and power consumption will primarily dictate the form factor of the
device where the processing can operate
4.  Cost, which is primarily dictated by the chip area, determines how much one
would pay for this solution
Evaluation Process
Copyright © 2018 Massachusetts Institute of Technology 68
•  DNNs are a critical component in the AI revolution, delivering record breaking
accuracy on many important AI tasks for a wide range of applications; however, it
comes at the cost of high computational complexity
•  Efficient processing of DNNs is an important area of research with many
promising opportunities for innovation at various levels of hardware design,
including algorithm co-design
•  When considering different DNN solutions it is important to evaluate with the
appropriate workload in term of both input and model, and recognize that they are
evolving rapidly.
•  It’s important to consider a comprehensive set of metrics when evaluating
different DNN solutions: accuracy, speed, energy, and cost
Summary
Acknowledgements: This work is funded by the DARPA YFA grant, MIT Center
for Integrated Circuits & Systems, and gifts from Intel, Nvidia and Google.
Copyright © 2018 Massachusetts Institute of Technology 69
•  Overview Paper
•  V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, “Efficient Processing of Deep Neural
Networks: A Tutorial and Survey”, Proceedings of the IEEE, 2017
https://arxiv.org/pdf/1703.09039.pdf
•  More info about Eyeriss and Tutorial on DNN Architectures
http://eyeriss.mit.edu
•  MIT Professional Education Course on “Designing Efficient Deep Learning
Systems” http://professional-education.mit.edu/deeplearning
References
For updates on Eyerissv2, Eyexam, NetAdapt, etc.
or join EEMS news mailing list
Copyright © 2018 Massachusetts Institute of Technology 70
•  A. Suleiman*, Y.-H. Chen*, J. Emer, V. Sze, "Towards Closing the Energy Gap Between HOG and CNN
Features for Embedded Vision," IEEE International Symposium of Circuits and Systems (ISCAS), May 2017.
•  V. Sze, Y.-H. Chen, J. Emer, A. Suleiman, Z. Zhang, "Hardware for Machine Learning: Challenges and
Opportunities," IEEE Custom Integrated Circuits Conference (CICC), Invited Paper, May 2017.
•  Y.-H. Chen, T. Krishna, J. Emer, V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep
Convolutional Neural Networks," IEEE Journal of Solid State Circuits (JSSC), ISSCC Special Issue, Vol. 52,
No. 1, pp. 127-138, January 2017.
•  Y.-H. Chen, J. Emer, V. Sze, "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional
Neural Networks," International Symposium on Computer Architecture (ISCA), pp. 367-379, June 2016.
•  Y.-H. Chen, T. Krishna, J. Emer, V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep
Convolutional Neural Networks," IEEE International Conference on Solid-State Circuits (ISSCC), pp.
262-264, February 2016.
References
Copyright © 2018 Massachusetts Institute of Technology 71
•  T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, V. Sze, H. Adam, "NetAdapt: Platform-Aware Neural
Network Adaptation for Mobile Applications," arXiv, April 2018.
•  Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, "Understanding the Limitations of Existing Energy-Efficient
Design Approaches for Deep Neural Networks," SysML Conference, February 2018.
•  V. Sze, T.-J. Yang, Y.-H. Chen, J. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial and
Survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, December 2017.
•  T.-J. Yang, Y.-H. Chen, J. Emer, V. Sze, "A Method to Estimate the Energy Consumption of Deep Neural
Networks," Asilomar Conference on Signals, Systems and Computers, Invited Paper, October 2017.
•  T.-J. Yang, Y.-H. Chen, V. Sze, "Designing Energy-Efficient Convolutional Neural Networks using Energy-
Aware Pruning," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
•  Y.-H. Chen, J. Emer, V. Sze, "Using Dataflow to Optimize Energy Efficiency of Deep Neural Network
Accelerators," IEEE Micro's Top Picks from the Computer Architecture Conferences, May/June 2017.
References
Copyright © 2018 Massachusetts Institute of Technology 72
•  M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary
weights during propagations,” in NIPS, 2015.
•  F. Li and B. Liu, “Ternary weight networks,” in NIPS Workshop on Efficient Methods for Deep Neural
Networks, 2016.
•  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNORNet: ImageNet Classification Using Binary
Convolutional Neural Networks,” in ECCV, 2016
•  E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S. Wong, “Lognet: Energy-Efficient Neural
Networks Using Logrithmic Computations,” in ICASSP, 2017.
•  F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-
level accuracy with 50x fewer parameters and <1MB model size,” ICLR , 2017
•  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam,
“Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:
1704.04861, 2017.
References
Copyright © 2018 Massachusetts Institute of Technology 73
•  A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” in CVPR, 2016.
•  Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal Brain Damage,” in NIPS, 1990.
•  S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural
networks,” in NIPS, 2015.
References

Más contenido relacionado

La actualidad más candente

"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ..."Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
Edge AI and Vision Alliance
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Simplilearn
 
Lec14: Evaluation Framework for Medical Image Segmentation
Lec14: Evaluation Framework for Medical Image SegmentationLec14: Evaluation Framework for Medical Image Segmentation
Lec14: Evaluation Framework for Medical Image Segmentation
Ulaş Bağcı
 

La actualidad más candente (20)

Grovers Algorithm
Grovers Algorithm Grovers Algorithm
Grovers Algorithm
 
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ..."Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
 
Resnet
ResnetResnet
Resnet
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 
Deep learning seminar report
Deep learning seminar reportDeep learning seminar report
Deep learning seminar report
 
07 regularization
07 regularization07 regularization
07 regularization
 
Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overview
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketaki
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
CNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesCNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent Advances
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Computer Vision - Single View
Computer Vision - Single ViewComputer Vision - Single View
Computer Vision - Single View
 
Artificial Neural Network seminar presentation using ppt.
Artificial Neural Network seminar presentation using ppt.Artificial Neural Network seminar presentation using ppt.
Artificial Neural Network seminar presentation using ppt.
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural Networks
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Soft computing01
Soft computing01Soft computing01
Soft computing01
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
 
Lec14: Evaluation Framework for Medical Image Segmentation
Lec14: Evaluation Framework for Medical Image SegmentationLec14: Evaluation Framework for Medical Image Segmentation
Lec14: Evaluation Framework for Medical Image Segmentation
 
Computational Intelligence and Applications
Computational Intelligence and ApplicationsComputational Intelligence and Applications
Computational Intelligence and Applications
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rules
 

Similar a "Approaches for Energy Efficient Implementation of Deep Neural Networks," a Presentation from MIT

Presentation of Eco-efficient Cloud Computing Framework for Higher Learning I...
Presentation of Eco-efficient Cloud Computing Framework for Higher Learning I...Presentation of Eco-efficient Cloud Computing Framework for Higher Learning I...
Presentation of Eco-efficient Cloud Computing Framework for Higher Learning I...
rodrickmero
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
Bomm Kim
 
Super-linear speedup for real-time condition monitoring using image processi...
Super-linear speedup for real-time condition monitoring using  image processi...Super-linear speedup for real-time condition monitoring using  image processi...
Super-linear speedup for real-time condition monitoring using image processi...
IJECEIAES
 
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
NECST Lab @ Politecnico di Milano
 

Similar a "Approaches for Energy Efficient Implementation of Deep Neural Networks," a Presentation from MIT (20)

Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 
RECAP: The Simulation Approach
RECAP: The Simulation ApproachRECAP: The Simulation Approach
RECAP: The Simulation Approach
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙
Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙
Cloud Computing,雲端運算-中研院網格計畫主持人林誠謙
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
Presentation of Eco-efficient Cloud Computing Framework for Higher Learning I...
Presentation of Eco-efficient Cloud Computing Framework for Higher Learning I...Presentation of Eco-efficient Cloud Computing Framework for Higher Learning I...
Presentation of Eco-efficient Cloud Computing Framework for Higher Learning I...
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
 
Low Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard PlatformLow Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard Platform
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
 
Super-linear speedup for real-time condition monitoring using image processi...
Super-linear speedup for real-time condition monitoring using  image processi...Super-linear speedup for real-time condition monitoring using  image processi...
Super-linear speedup for real-time condition monitoring using image processi...
 
Hassan - Condor _48_x_36
Hassan - Condor _48_x_36Hassan - Condor _48_x_36
Hassan - Condor _48_x_36
 
Brain in the Cloud: Machine Learning on OpenStack & Kubernetes Done Right - E...
Brain in the Cloud: Machine Learning on OpenStack & Kubernetes Done Right - E...Brain in the Cloud: Machine Learning on OpenStack & Kubernetes Done Right - E...
Brain in the Cloud: Machine Learning on OpenStack & Kubernetes Done Right - E...
 
rerngvit_phd_seminar
rerngvit_phd_seminarrerngvit_phd_seminar
rerngvit_phd_seminar
 
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 
Networking Challenges for the Next Decade
Networking Challenges for the Next DecadeNetworking Challenges for the Next Decade
Networking Challenges for the Next Decade
 
Traffic Sign Recognition System
Traffic Sign Recognition SystemTraffic Sign Recognition System
Traffic Sign Recognition System
 

Más de Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
Edge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
Edge AI and Vision Alliance
 

Más de Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

"Approaches for Energy Efficient Implementation of Deep Neural Networks," a Presentation from MIT

  • 1. Copyright © 2018 Massachusetts Institute of Technology 1 Vivienne Sze May 23, 2018 Approaches for Energy Efficient Implementation of Deep Neural Networks In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang
  • 2. Copyright © 2018 Massachusetts Institute of Technology 2 Video is the Biggest Big Data Need energy-efficient pixel processing! Over 70% of today’s Internet traffic is video Over 300 hours of video uploaded to YouTube every minute Over 500 million hours of video surveillance collected every day Energy limited due to ba1ery capacity Power limited due to heat dissipa8on
  • 3. Copyright © 2018 Massachusetts Institute of Technology 3 Increased Accuracy with Deep Learning Deep Learning requires significantly more computa5on than previous approaches 0 5 10 15 20 25 30 2010 2011 2012 2013 2014 2015 Human ImageNet Top 5 Classifica3on Error (%) Large error reduc+on due to Deep Learning Hand-cra?ed feature- based designs Deep Learning- based designs [O. Russakovsky et al., IJCV, 2015]
  • 4. Copyright © 2018 Massachusetts Institute of Technology 4 Deep Convolutional Neural Networks Classes FC Layers Modern deep CNN: up to 1000 CONV layers CONV Layer CONV Layer Low-level Features High-level Features
  • 5. Copyright © 2018 Massachusetts Institute of Technology 5 CONV Layer CONV Layer Low-level Features High-level Features Classes FC Layers 1 – 3 layers Deep Convolutional Neural Networks
  • 6. Copyright © 2018 Massachusetts Institute of Technology 6 Deep Convolutional Neural Networks Classes CONV Layer CONV Layer FC Layers Convolutions account for more than 90% of overall computation, dominating runtime and energy consumption
  • 7. Copyright © 2018 Massachusetts Institute of Technology 7 High-Dimensional CNN Convolution R S H a plane of input activations a.k.a. input feature map (fmap) filter (weights) W
  • 8. Copyright © 2018 Massachusetts Institute of Technology 8 High-Dimensional CNN Convolution R filter (weights) S E F Partial Sum (psum) Accumulation input fmap output fmap Element-wise Multiplication H W an output activation
  • 9. Copyright © 2018 Massachusetts Institute of Technology 9 High-Dimensional CNN Convolution H R filter (weights) S E Sliding Window Processing input fmap an output activation output fmap W F
  • 10. Copyright © 2018 Massachusetts Institute of Technology 10 High-Dimensional CNN Convolution … E output fmap …… many filters (M) Many Output Channels (M) M …R S 1 R S … …… C … M H input fmap … …… … C … C …… … W F
  • 11. Copyright © 2018 Massachusetts Institute of Technology 11 High-Dimensional CNN Convolution … M … Many Input fmaps (N) Many Output fmaps (N) …R S R S … …… C … C …… … filters … E F …… H …… C … H W … …… … C … … E …… 1 1 N N W F Image batch size: 1 – 256 (N)
  • 12. Copyright © 2018 Massachusetts Institute of Technology 12 Large Size with Varying Shapes Layer Filter Size (R) # Filters (M) # Channels (C) Stride 1 11x11 96 3 4 2 5x5 256 48 1 3 3x3 384 256 1 4 3x3 384 192 1 5 3x3 256 192 1 AlexNet Convolu-onal Layer Configura-ons [Krizhevsky, NIPS 2012] 34k Params 307k Params 885k Params Layer 1 Layer 2 Layer 3 105M MACs 224M MACs 150M MACs
  • 13. Copyright © 2018 Massachusetts Institute of Technology 13 Popular DNNs •  LeNet (1998) •  AlexNet (2012) •  OverFeat (2013) •  VGGNet (2014) •  GoogleNet (2014) •  ResNet (2015) 0 2 4 6 8 10 12 14 16 18 2012 2013 2014 2015 Human Accuracy(Top5error) [O. Russakovsky et al., IJCV 2015] AlexNet OverFeat GoogLeNet ResNet Clarifai VGGNet ImageNet: Large Scale Visual RecogniFon Challenge (ILSVRC)
  • 14. Copyright © 2018 Massachusetts Institute of Technology 14 Popular DNNs Metrics LeNet-5 AlexNet VGG-16 GoogLeNet (v1) ResNet-50 Top-5 error n/a 16.4 7.4 6.7 5.3 Input Size 28x28 227x227 224x224 224x224 224x224 # of CONV Layers 2 5 16 21 (depth) 49 # of Weights 2.6k 2.3M 14.7M 6.0M 23.5M # of MACs 283k 666M 15.3G 1.43G 3.86G # of FC layers 2 3 3 1 1 # of Weights 58k 58.6M 124M 1M 2M # of MACs 58k 58.6M 124M 1M 2M Total Weights 60k 61M 138M 7M 25.5M Total MACs 341k 724M 15.5G 1.43G 3.9G CONV Layers increasingly important!
  • 15. Copyright © 2018 Massachusetts Institute of Technology 15 Training versus Inference Training (determine weights) Weights Large Datasets Inference (use weights)
  • 16. Copyright © 2018 Massachusetts Institute of Technology 16 •  Accuracy •  Well defined dataset, DNN Model and task •  Programmability •  Support various DNN Models with different filter weights •  Energy/Power: •  Energy per operation and DRAM Bandwidth •  Throughput/Latency •  GOPS, frame rate, delay, batch size •  Cost •  Area (memory and logic size) Key Metrics ImageNet DRAM Chip Computer Vision Speech Recogni6on [Sze et al., CICC 2017]
  • 17. Copyright © 2018 Massachusetts Institute of Technology 17 GPUs and CPUs Targeting Deep Learning Xeon Phi “optimized for deep learning” Intel Knights Landing (2016) Intel Knights Mills (2017) Nvidia PASCAL GP100 (2016) Nvidia VOLTA GV100 (2017) Use matrix multiplication libraries on CPUs and GPUs
  • 18. Copyright © 2018 Massachusetts Institute of Technology 18 Accelerate Matrix Multiplication •  Implementation: Matrix Multiplication (GEMM) •  CPU: OpenBLAS, Intel MKL, etc •  GPU: cuBLAS, cuDNN, etc •  Optimized by tiling to storage hierarchy
  • 19. Copyright © 2018 Massachusetts Institute of Technology 19 Map DNN to a Matrix Multiplication •  Convert to matrix mult. using the Toeplitz Matrix 1 2 3 4 5 6 7 8 9 1 2 3 4 Filter Input Fmap Output Fmap * = 1 2 3 4 1 2 3 41 2 4 5 2 3 5 6 4 5 7 8 5 6 8 9 1 2 3 4 × = Toeplitz Matrix (w/ redundant data) Convolution: Matrix Mult: 1 2 4 5 2 3 5 6 4 5 7 8 5 6 8 9 Data is repeated Goal: Reduced number of operations to increase throughput
  • 20. Copyright © 2018 Massachusetts Institute of Technology 20 •  Goal: Bitwise same result, but reduce number of operations •  Focuses mostly on compute Computation Transformations
  • 21. Copyright © 2018 Massachusetts Institute of Technology 21 Analogy: Gauss’s Multiplication Algorithm 4 multiplications + 3 additions 3 multiplications + 5 additions Reduce number of multiplications, but increase number of additions
  • 22. Copyright © 2018 Massachusetts Institute of Technology 22 Reduce Operations in Matrix Multiplication •  Winograd [Lavin, CVPR 2016] –  Pro: 2.25x speed up for 3x3 filter –  Con: Specialized processing depending on filter size •  Fast Fourier Transform [Mathieu, ICLR 2014] –  Pro: Direct convolution O(No 2Nf 2) to O(No 2log2No) –  Con: Increase storage requirements •  Strassen [Cong, ICANN 2014] –  Pro: O(N3) to (N2.807) –  Con: Numerical stability
  • 23. Copyright © 2018 Massachusetts Institute of Technology 23 cuDNN: Speed up with Transformations Source: Nvidia
  • 24. Copyright © 2018 Massachusetts Institute of Technology 24 Designing Specialized Hardware (Accelerators) for DNNs
  • 25. Copyright © 2018 Massachusetts Institute of Technology 25 Properties We Can Leverage •  Operations exhibit high parallelism à high throughput possible •  Memory Access is the Bottleneck ALU Memory Read Memory WriteMAC* * multiply-and-accumulate filter weight image pixel partial sum updated partial sum 200x 1x DRAM DRAM
  • 26. Copyright © 2018 Massachusetts Institute of Technology 26 Properties We Can Leverage •  Operations exhibit high parallelism à high throughput possible •  Input data reuse opportunities (up to 500x) à exploit low-cost memory Convolu'onal Reuse (pixels, weights) Filter Image … … … … … … … … … Image Reuse (pixels) … … … … … … … … … … … 2 1 Filters Image Filter Reuse (weights) … … … … … … … … … … … Filter Images 2 1
  • 27. Copyright © 2018 Massachusetts Institute of Technology 27 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT) Register File Memory Hierarchy Spatial Architecture (Dataflow Processing) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Control Memory Hierarchy ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU
  • 28. Copyright © 2018 Massachusetts Institute of Technology 28 Advantages of Spatial Architecture Temporal Architecture (SIMD/SIMT) Register File Memory Hierarchy ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Control Spatial Architecture (Dataflow Processing) Memory Hierarchy ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Efficient Data Reuse Distributed local storage (RF) Inter-PE Communica5on Sharing among regions of PEs Processing Element (PE) Control Reg File0.5 – 1.0 kB
  • 29. Copyright © 2018 Massachusetts Institute of Technology 29 Data Movement is Expensive Maximize data reuse at low cost levels of hierarchy DRAM Global Buffer PE PE PE ALU fetch data to run a MAC here ALU Buffer ALU RF ALU Normalized Energy Cost* 200× 6× PE ALU 2× 1× 1× (Reference) DRAM ALU 0.5 – 1.0 kB 100 – 500 kB NoC: 200 – 1000 PEs * measured from a commercial 65nm process
  • 30. Copyright © 2018 Massachusetts Institute of Technology 30 Weight Stationary (WS) Global Buffer W0 W1 W2 W3 W4 W5 W6 W7 Psum Activation PE Weight •  Minimize weight read energy consumption −  maximize convolutional and filter reuse of weights •  Examples: [nn-X (NeuFlow), CVPRW 2014] [Park, ISSCC 2015] [Origami, GLSVLSI 2015] [Google TPU, ISCA 2017]
  • 31. Copyright © 2018 Massachusetts Institute of Technology 31 Output Stationary (OS) Global Buffer P0 P1 P2 P3 P4 P5 P6 P7 Pixel Weight PE Psum •  Minimize partial sum R/W energy consumption −  maximize local accumulation •  Examples: [Gupta, ICML 2015] [ShiDianNao, ISCA 2015] [ENVISION, ISSCC 2017] [Thinker, JSSC 2017]
  • 32. Copyright © 2018 Massachusetts Institute of Technology 32 No Local Reuse (NLR) PE Pixel Psum Global Buffer Weight •  Use a large global buffer as shared storage −  Reduce DRAM access energy consumption •  Examples: [DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014] [Zhang, FPGA 2015]
  • 33. Copyright © 2018 Massachusetts Institute of Technology 33 Row Stationary Dataflow PE 1 Row 1 Row 1 PE 2 Row 2 Row 2 PE 3 Row 3 Row 3 Row 1 = * PE 4 Row 1 Row 2 PE 5 Row 2 Row 3 PE 6 Row 3 Row 4 Row 2 = * PE 7 Row 1 Row 3 PE 8 Row 2 Row 4 PE 9 Row 3 Row 5 Row 3 = * * * * * * * * * * Optimize for overall energy efficiency instead for only a certain data type
  • 34. Copyright © 2018 Massachusetts Institute of Technology 34 Dataflow Comparison: CONV Layers 0 0.5 1 1.5 2 Normalized Energy/MAC WS OSA OSB OSC NLR RS psums weights pixels DNN Dataflows RS optimizes for the best overall energy efficiency resulting in a 1.4× – 2.5× lower energy than other dataflows
  • 35. Copyright © 2018 Massachusetts Institute of Technology 35 Eyeriss Deep CNN Accelerator Off-Chip DRAM … … … … … … Decomp Comp ReLU Input Image Output Image Filter Filt Img Psum Psum Global Buffer SRAM 108KB 64 bits DCNN Accelerator 14×12 PE Array Link Clock Core Clock [Chen et al., ISSCC 2016] 4000 µm 4000µm Global Buffer Spatial Array (168 PEs) Fabricated in a 65nm process AlexNet @ 35 fps while consuming 278mW >10x more energy efficient than a mobile GPU
  • 36. Copyright © 2018 Massachusetts Institute of Technology 36 Features: Energy versus Accuracy 0.1 1 10 100 1000 10000 0 20 40 60 80 Accuracy (Average Precision) Energy/ Pixel (nJ) VGG162 AlexNet2 HOG1 Measured in 65nm* 1.  [Suleiman, VLSI 2016] 2.  [Chen, ISSCC 2016] * Only feature extrac6on. Does not include data, augmenta6on, ensemble and classifica6on energy, etc. Measured in on VOC 2007 Dataset 1.  DPM v5 [Girshick, 2012] 2.  Fast R-CNN [Girshick, CVPR 2015] Exponen6al Linear Video Compression [Suleiman et al., ISCAS 2017]
  • 37. Copyright © 2018 Massachusetts Institute of Technology 37 Designing Efficient DNN Models
  • 38. Copyright © 2018 Massachusetts Institute of Technology 38 •  Reduce size of operands for storage/compute •  Floating point à Fixed point •  Bit-width reduction •  Non-linear quantization •  Reduce number of operations for storage/compute •  Exploit Activation Statistics (Compression) •  Network Pruning •  Compact Network Architectures Approaches
  • 39. Copyright © 2018 Massachusetts Institute of Technology 39 Commercial Products using 8-bit Integer Nvidia’s Pascal (2016) Google’s TPU (2016)
  • 40. Copyright © 2018 Massachusetts Institute of Technology 40 •  Reduce number of bits •  Binary Nets [Courbariaux, NIPS 2015] •  Reduce number of unique weights •  Ternary Weight Nets [Li, arXiv 2016] •  XNOR-Net [Rategari, ECCV 2016] •  Non-Linear Quantization •  LogNet [Lee, ICASSP 2017] Reduced Precision in Research Log Domain Quantization
  • 41. Copyright © 2018 Massachusetts Institute of Technology 41 Sparsity in Feature Map 9 -1 -3 1 -5 5 -2 6 -1 Many zeros in output fmaps after ReLU ReLU 9 0 0 1 0 5 0 6 0 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 CONV Layer # of activations # of non-zero activations (Normalized)
  • 42. Copyright © 2018 Massachusetts Institute of Technology 42 Exploit Sparsity == 0 Zero Buff Scratch Pad Enable Zero Data Skipping Register File No R/W No Switching Method 1: Skip memory access and computa6on 45% energy savings [Chen et al., ISSCC 2016]
  • 43. Copyright © 2018 Massachusetts Institute of Technology 43 Exploit Sparsity Method 2: Compress data to reduce storage and data movement 0 1 2 3 4 5 6 1 2 3 4 5 DRAM Access (MB) AlexNet Conv Layer Uncompressed Compressed 1 2 3 4 5 AlexNet Conv Layer DRAM Access (MB) 0 2 4 6 1.2× 1.4× 1.7× 1.8× 1.9× Uncompressed Fmaps + Weights RLE Compressed Fmaps + Weights [Chen et al., ISSCC 2016]
  • 44. Copyright © 2018 Massachusetts Institute of Technology 44 Pruning – Make Weights Sparse retraining Op#mal Brain Damage [Lecun et al., NIPS 1989] Prune DNN based on magnitude of weights [Han et al., NIPS 2015] Example: AlexNet Weight Reduction: CONV layers 2.7x, FC layers 9.9x Overall Reduction: Weights 9x, MACs 3x
  • 45. Copyright © 2018 Massachusetts Institute of Technology 45 Network Architecture Design Build Network with series of Small Filters 5x5 filter Two 3x3 filters decompose Apply sequentially decompose 5x5 filter 5x1 filter 1x5 filter Apply sequentially GoogleNet/ Inception v3 VGG-16 separable filters
  • 46. Copyright © 2018 Massachusetts Institute of Technology 46 1x1 Bottleneck in Popular DNN models compress expand ResNet compress GoogleNet SqueezeNet
  • 47. Copyright © 2018 Massachusetts Institute of Technology 47 Understanding the Limitations of Existing Energy-Efficient Design Approaches for Deep Neural Networks [Y.-H. Chen et al., SysML Conference, February 2018]
  • 48. Copyright © 2018 Massachusetts Institute of Technology 48 Energy-Efficient Processing of DNNs V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, Dec. 2017 A significant amount of algorithm and hardware research on energy-efficient processing of DNNs We idenOfied various limitaOons to exisOng approaches http://eyeriss.mit.edu/tutorial.html
  • 49. Copyright © 2018 Massachusetts Institute of Technology 49 Design of Efficient DNN Algorithms •  Popular efficient DNN algorithm approaches Network Pruning C 1 1 S R 1 R S C Compact Network Architectures Examples: SqueezeNet, MobileNet ... also reduced precision •  Focus on reducing number of MACs and weights •  Does it translate to energy savings?
  • 50. Copyright © 2018 Massachusetts Institute of Technology 50 Energy-Evaluation Methodology CNN Shape Configuration (# of channels, # of filters, etc.) CNN Weights and Input Data [0.3, 0, -0.4, 0.7, 0, 0, 0.1, …] CNN Energy Consumption L1 L2 L3 Energy … Memory Accesses Optimization # of MACs Calculation … # acc. at mem. level 1 # acc. at mem. level 2 # acc. at mem. level n # of MACs Hardware Energy Costs of each MAC and Memory Access Ecomp Edata [Yang et al., CVPR 2017] Energy estimation tool available at http://eyeriss.mit.edu
  • 51. Copyright © 2018 Massachusetts Institute of Technology 51 Key Observations •  Number of weights alone is not a good metric for energy •  All data types should be considered Output Feature Map 43% Input Feature Map 25% Weights 22% Computa:on 10% Energy Consump:on of GoogLeNet [Yang et al., CVPR 2017]
  • 52. Copyright © 2018 Massachusetts Institute of Technology 52 Energy Consumption of Existing DNNs AlexNet SqueezeNet GoogLeNet ResNet-50 VGG-16 77% 79% 81% 83% 85% 87% 89% 91% 93% 5E+08 5E+09 5E+10 Top-5 Accuracy Normalized Energy Consump9on Original DNN [Yang et al., CVPR 2017] Deeper CNNs with fewer weights do not necessarily consume less energy than shallower CNNs with more weights v1.0 Batch sizes between 44 to 48
  • 53. Copyright © 2018 Massachusetts Institute of Technology 53 Magnitude-based Weight Pruning AlexNet SqueezeNet GoogLeNet ResNet-50 VGG-16 AlexNet SqueezeNet 77% 79% 81% 83% 85% 87% 89% 91% 93% 5E+08 5E+09 5E+10 Top-5 Accuracy Normalized Energy Consump9on Original DNN Magnitude-based Pruning [6] [Han et al., NIPS 2015] Reduce number of weights by removing small magnitude weights v1.0
  • 54. Copyright © 2018 Massachusetts Institute of Technology 54 Energy-Aware Pruning AlexNet SqueezeNet GoogLeNet ResNet-50 VGG-16 AlexNet SqueezeNet AlexNet SqueezeNet GoogLeNet 77% 79% 81% 83% 85% 87% 89% 91% 93% 5E+08 5E+09 5E+10 Top-5 Accuracy Normalized Energy Consump9on Original DNN Magnitude-based Pruning [6] Energy-aware Pruning (This Work) 1.74x [Yang et al., CVPR 2017] Directly target energy and incorporate it into the optimization of DNNs to provide greater energy savings v1.0
  • 55. Copyright © 2018 Massachusetts Institute of Technology 55 •  Automatically adapt DNN to a mobile platform to reach a target latency or energy budget •  Use empirical measurements to guide optimization (avoid modeling of tool chain or platform architecture) NetAdapt: Platform-Aware DNN Adaptation NetAdapt Measure … Network Proposals Empirical Measurements Metric Proposal A … Proposal Z Latency 15.6 … 14.3 Energy 41 … 46 … … … Pretrained Network Metric Budget Latency 3.8 Energy 10.5 Budget Adapted Network … … Pla8orm A B C D Z [Yang et al., arXiv 2018]In collaboration with Google’s Mobile Vision Team
  • 56. Copyright © 2018 Massachusetts Institute of Technology 56 Latency vs. Accuracy Tradeoff with NetAdapt •  NetAdapt boosts the real inference speed of MobileNet by 1.7x with higher accuracy +0.3% accuracy 1.7x faster +0.3% accuracy 1.6x faster *Tested on the ImageNet dataset and a Google Pixel 1 CPU
  • 57. Copyright © 2018 Massachusetts Institute of Technology 57 Many Efficient DNN Design Approaches Network Pruning C 1 1 S R 1 R S C Compact Network Architectures 10100101000000000101000000000100 01100110 Reduce Precision 32-bit float 8-bit fixed Binary 0 No guarantee that DNN algorithm designer will use a given approach. Need flexible hardware!
  • 58. Copyright © 2018 Massachusetts Institute of Technology 58 •  Specialized DNN hardware often rely on certain properties of DNN in order to achieve high energy- efficiency •  Example: Reduce memory access by amortizing across MAC array Existing DNN Architectures 58 MAC array Weight Memory Activation Memory Weight reuse Activation reuse
  • 59. Copyright © 2018 Massachusetts Institute of Technology 59 •  Example: reuse depends on # of channels, feature map/batch size •  Not efficient across all network architectures (e.g., compact DNNs) Limitation of Existing DNN Architectures 59 MAC array (spatial accumulation) Number of filters (output channels) Number of input channels MAC array (temporal accumulation) Number of filters (output channels) feature map or batch size
  • 60. Copyright © 2018 Massachusetts Institute of Technology 60 (MAC/cycle) (MAC/data) Step 1: maximum workload parallelism Step 2: maximum dataflow parallelism Step 3: # of act. PEs under a finite PE array size Number of PEs Step 4: # of act. PEs under fixed PE array dims. peak perf. Step 5: # of act. PEs under fixed storage cap. workload operational intensity Step 6: lower act. PE utilization due to insuff. avg. BW Step 7: lower act. PE utilization due to insuff. inst. BW Slope = BW to only act. PE Eyexam: Understanding Sources of Inefficiencies in DNN Accelerators 60 A systematic way to evaluate how each architectural decision affects performance (throughput) for a given DNN workload Tightens the roofline model (Theoretical Peak Performance) [Chen et al., In Submission]
  • 61. Copyright © 2018 Massachusetts Institute of Technology 61 To efficiently support: •  Wide range of filter shapes •  Large and Compact •  Different Layers •  e.g., CONV and FC •  Wide range of sparsity •  Dense and Sparse Eyeriss v2 On-chipBuffer Spatial PE Array Eyeriss (v1) [Chen et al. ISSCC 2016, ISCA 2016]
  • 62. Copyright © 2018 Massachusetts Institute of Technology 62 Benchmarking Metrics for DNN Hardware How can we compare designs? V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, Dec. 2017
  • 63. Copyright © 2018 Massachusetts Institute of Technology 63 •  Accuracy •  Quality of result for a given task •  Throughput •  Analytics on high volume data •  Real-time performance (e.g., video at 30 fps) •  Latency •  For interactive applications (e.g., autonomous navigation) •  Energy and Power •  Edge and embedded devices have limited battery capacity •  Data centers have stringent power ceilings due to cooling costs •  Hardware Cost •  $$$ Metrics for DNN Hardware
  • 64. Copyright © 2018 Massachusetts Institute of Technology 64 •  Accuracy •  Difficulty of dataset and/or task should be considered •  Throughput •  Number of cores (include utilization along with peak performance) •  Runtime for running specific DNN models •  Latency •  Include batch size used in evaluation •  Energy and Power •  Power consumption for running specific DNN models •  Include external memory access •  Hardware Cost •  On-chip storage, number of cores, chip area + process technology Specifications to Evaluate Metrics
  • 65. Copyright © 2018 Massachusetts Institute of Technology 65 Example: Metrics of Eyeriss Chip Metric Units Input Name of CNN Model Text AlexNet Top-5 error classification on ImageNet # 19.8 Supported Layers All CONV Bits per weight # 16 Bits per input activation # 16 Batch Size # 4 Runtime ms 115.3 Power mW 278 Off-chip Access per Image Inference MBytes 3.85 Number of Images Tested # 100 ASIC Specs Input Process Technology 65nm LP TSMC (1.0V) Total Core Area (mm2) 12.25 Total On-Chip Memory (kB) 192 Number of Multipliers 168 Clock Frequency (MHz) 200 Core area (mm2) / multiplier 0.073 On-Chip memory (kB) / multiplier 1.14 Measured or Simulated Measured
  • 66. Copyright © 2018 Massachusetts Institute of Technology 66 •  All metrics should be reported for fair evaluation of design tradeoffs •  Examples of what can happen if certain metric is omitted: •  Without the accuracy given for a specific dataset and task, one could run a simple DNN and claim low power, high throughput, and low cost – however, the processor might not be usable for a meaningful task •  Without reporting the off-chip bandwidth, one could build a processor with only multipliers and claim low cost, high throughput, high accuracy, and low chip power – however, when evaluating system power, the off-chip memory access would be substantial •  Are results measured or simulated? On what test data? Comprehensive Coverage
  • 67. Copyright © 2018 Massachusetts Institute of Technology 67 The evaluation process for whether a DNN system is a viable solution for a given application might go as follows: 1.  Accuracy determines if it can perform the given task 2.  Latency and throughput determine if it can run fast enough and in real-time 3.  Energy and power consumption will primarily dictate the form factor of the device where the processing can operate 4.  Cost, which is primarily dictated by the chip area, determines how much one would pay for this solution Evaluation Process
  • 68. Copyright © 2018 Massachusetts Institute of Technology 68 •  DNNs are a critical component in the AI revolution, delivering record breaking accuracy on many important AI tasks for a wide range of applications; however, it comes at the cost of high computational complexity •  Efficient processing of DNNs is an important area of research with many promising opportunities for innovation at various levels of hardware design, including algorithm co-design •  When considering different DNN solutions it is important to evaluate with the appropriate workload in term of both input and model, and recognize that they are evolving rapidly. •  It’s important to consider a comprehensive set of metrics when evaluating different DNN solutions: accuracy, speed, energy, and cost Summary Acknowledgements: This work is funded by the DARPA YFA grant, MIT Center for Integrated Circuits & Systems, and gifts from Intel, Nvidia and Google.
  • 69. Copyright © 2018 Massachusetts Institute of Technology 69 •  Overview Paper •  V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey”, Proceedings of the IEEE, 2017 https://arxiv.org/pdf/1703.09039.pdf •  More info about Eyeriss and Tutorial on DNN Architectures http://eyeriss.mit.edu •  MIT Professional Education Course on “Designing Efficient Deep Learning Systems” http://professional-education.mit.edu/deeplearning References For updates on Eyerissv2, Eyexam, NetAdapt, etc. or join EEMS news mailing list
  • 70. Copyright © 2018 Massachusetts Institute of Technology 70 •  A. Suleiman*, Y.-H. Chen*, J. Emer, V. Sze, "Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision," IEEE International Symposium of Circuits and Systems (ISCAS), May 2017. •  V. Sze, Y.-H. Chen, J. Emer, A. Suleiman, Z. Zhang, "Hardware for Machine Learning: Challenges and Opportunities," IEEE Custom Integrated Circuits Conference (CICC), Invited Paper, May 2017. •  Y.-H. Chen, T. Krishna, J. Emer, V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," IEEE Journal of Solid State Circuits (JSSC), ISSCC Special Issue, Vol. 52, No. 1, pp. 127-138, January 2017. •  Y.-H. Chen, J. Emer, V. Sze, "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," International Symposium on Computer Architecture (ISCA), pp. 367-379, June 2016. •  Y.-H. Chen, T. Krishna, J. Emer, V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," IEEE International Conference on Solid-State Circuits (ISSCC), pp. 262-264, February 2016. References
  • 71. Copyright © 2018 Massachusetts Institute of Technology 71 •  T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, V. Sze, H. Adam, "NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications," arXiv, April 2018. •  Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, "Understanding the Limitations of Existing Energy-Efficient Design Approaches for Deep Neural Networks," SysML Conference, February 2018. •  V. Sze, T.-J. Yang, Y.-H. Chen, J. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial and Survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, December 2017. •  T.-J. Yang, Y.-H. Chen, J. Emer, V. Sze, "A Method to Estimate the Energy Consumption of Deep Neural Networks," Asilomar Conference on Signals, Systems and Computers, Invited Paper, October 2017. •  T.-J. Yang, Y.-H. Chen, V. Sze, "Designing Energy-Efficient Convolutional Neural Networks using Energy- Aware Pruning," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. •  Y.-H. Chen, J. Emer, V. Sze, "Using Dataflow to Optimize Energy Efficiency of Deep Neural Network Accelerators," IEEE Micro's Top Picks from the Computer Architecture Conferences, May/June 2017. References
  • 72. Copyright © 2018 Massachusetts Institute of Technology 72 •  M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in NIPS, 2015. •  F. Li and B. Liu, “Ternary weight networks,” in NIPS Workshop on Efficient Methods for Deep Neural Networks, 2016. •  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNORNet: ImageNet Classification Using Binary Convolutional Neural Networks,” in ECCV, 2016 •  E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S. Wong, “Lognet: Energy-Efficient Neural Networks Using Logrithmic Computations,” in ICASSP, 2017. •  F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet- level accuracy with 50x fewer parameters and <1MB model size,” ICLR , 2017 •  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv: 1704.04861, 2017. References
  • 73. Copyright © 2018 Massachusetts Institute of Technology 73 •  A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” in CVPR, 2016. •  Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal Brain Damage,” in NIPS, 1990. •  S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural networks,” in NIPS, 2015. References