"Tailoring Convolutional Neural Networks for Low-Cost, Low-Power Implementation," a Presentation From Synopsys

Copyright © 2015 Synopsys Inc. 1
Bruno Lavigueur
12 May 2015
Tailoring CNNs for Low-cost,
Low-power Implementations

• Embedded vision subsystem, build from many silicon proven IPs
• DesignWare: ARC HS processor, AXI, DMA, Memory Compiler, …
• HAPS FPGA-based rapid prototyping system
Synopsys at a Glance
>5,300
Masters/PhD
Degrees
>2,300
IP Designers
>1,500
Applications
Engineers
>$2.2B
FY14
Revenue
32%
Revenue
on R&D
>9,300
Employees

• Convolutional Neural Network (CNN)
• Wide range of detection and classification
possible
• The majority of the published CNN graphs
are not tailored for embedded
• Memory requirements
• Number of floating point operations (# of MAC)
• Yet CNN have nice properties for parallelization on embedded devices
• Regular processing, feed forward dataflow, no data dependant
computation
• Key questions
• Can the size and complexity of the graph be reduced with minimal
impact on detection rates ?
• Number of layers, connectivity, size of convolution
• What is the impact of moving from floating to fixed point ?
CNN on Embedded Devices

How CNN Works (Once Trained)
• Multiple feature extraction layers
• Progressive refinement process
• Each successive layer extracts more complex features (higher level)
• Last layer performs classification
• Same computation (neuron) replicated multiple times
Input image Layer 1
Low level feature extraction
Pooling & down sampling
Layer 2
Mid-level features
Partially connected
Layer 3
High-level
features
Fully
connected
classification

• Each layer of convolutions extract progressively higher level features
• Subsampling / max pooling to “zoom out” and detect bigger objects
with smaller convolutions
• Non-linear function on each neuron to activate it
Visualising a CNN
Layer 1 output
sample
Layer 2 output
sample
Layer 3 output
sample
Layer 4 output
sample

• Convolution of
multiple inputs
together
• Fixed kernel size
• Optional subsampling
• 1, 2, 4x
• Optional max-pooling
• Very regular, repetitive
computation
• Dominated by MAC
• Deterministic
• Non-linear activation
function (sigmoid,
hyperbolic tangent,
rectifier)
CNN Computation
I0
IM-1
I1
O0
ON-1
M inputs
(XI * YI)
Z kernels (K * K) with
associated weights
N outputs (XO * YO)
Oj = act(Bj+ (Iv x Kw) + …)
Convolution (x)
act
act
Activation (tanh, ReLU)
…

• Given the nature of the algorithm,
there are many ways to accelerate
CNNs including:
• Vector / SIMD unit
• Systolic array / Streaming
• GPU
• Performance / Power / Area trade-offs will vary
• Depending on the architecture
• In all cases the main limitations will be
• Amount of closely coupled memory available
• Maximum number of Giga-MAC/s that can be sustained
• I/O bandwidth required & available
• Optimized data movement, efficient streaming
Moving Towards Embedded CNN
EV Processor
Shared
Memory
DMA
Interconnect
RISC CPU
32-bit
Core
32-bit
Core
32-bit
Core
32-bit
Core
CNN Engine
…
…
PE PE PE
PE PE PE

Moving CNN to Embedded Systems
• Graph Complexity
• Number of layers
(depth)
• Size of the
convolutions filters
• Number of
connections
between the layers
Compute requirements ALU width/costMemory size
Input
Layer 1 Layer 2 Layer 3 Layer 4
3 2 1
1 2 6
1 2 1
0 1
1 0
Image
Filter
5 8
3 3
Feature
map
Conv. = 4 6
2 2
Data precision# Coefficients
Act.

• Starting point:
• Multicoreware generated ~10 million faces/non-faces from over 200
Hollywood and Bollywood full length movies
• Trained CNN to detect faces in those movies
Example of a Big& Small CNN Application
Metric Alexnet like Embedded
version
Weight Space 400 MB 0.5 MB
Layers 10
(7Cv+3 FC)
5
(3 Cv+2 FC)
Compute 200x 1x
Bandwidth 400x 1x
F1-Score .963 .905
Accuracy .993 .981
VGA 30 FPS 4800 GOPS 24 GOPS
• Cv: Convolution layers
(partially connected)
• FC: Fully connected
layers

• Using standard open source projects to train networks with floating
point and GPU acceleration to explore network topology
• Cuda-convnet, Caffe, Theano
• Didn’t worry initially about numerical precision as literature has shown
CNN are robust to precision
• From scratch: Small networks can be trained very fast
• Enables lots of shots on goal :
• Using scripting and many GPU’s
• Number of network layers, convolutions, subsampling & pooling
• Explored huge space and quickly converged on a graph with good learning
• From an existing graph: Also worked backwards from high accuracy
large graph
• Iteratively reduced it and retrained the best ones
• End up with similar networks in both cases
Reducing Complexity of the Graph

• Improve F-1 score with classic techniques such as
• Data Normalization
• Hard negative mining (boosting)
• Annealing the learning rate
• Data Augmentation: Flip, Random Cropping, color space, ..
• Moved initial system from F1 of ~.74 to ~.90
• Once the graph topology and training is satisfying look at the impact of
moving to fixed point
• Test below are done with 31437 positive and 263145 negative samples
Training Optimizations
Initial Optimized
True positive 19706 27093
False positive 1769 1335
False negative 11731 4344
F-1 Score 0.7449 0.9051

• Compare output of every layer with reference floating
point version
• Differences may grow after each layer
• Detection threshold might need to be tweaked to
achieve similar results
Moving to Fixed Point: Empirical Approach
ReLU
Image
Filter
Convolution =
Accumulator Feature
map
200 64 1
150 50 1
1 10 220
4 0
0 -1
750 255
590 -20
Non-linear
function
750 255
590 0
Shift +
saturate
255 127
255 0
Greyscale
image, 8
bit pixels
Convert to
fixed-point
based on range,
e.g 16 bit
(Q2S13)
Make sure
accumulator
is wide
enough,
e.g. 32 bit
(signed)
Shift-right values to avoid overflow,
x = max(0, x) >> N
Choose ‘N’ according to dynamic
range of ‘x’ values

• FDDB: Face Detection
Data Set and Benchmark
• Results shown for the
embedded small & fixed
point graph
• Localization can be
improved with pre/post
processing
• Impacts scores
• Not done here
Results For Face Detection Application
Type F-1
Best (CascadeCNN) 0.91
Middle 10 average 0.85
Embedded – 40% 0.84
Embedded – 50% 0.82
Fixed point,
8bit

• Design time configurable
• Number of CNN Processing Elements
(2 to 8)
• Streaming interconnection network
configured for number of cores
• Runtime reconfigurable
• Flexible point-to-point connections
between all cores
• CNN-optimized instruction set
• Convolutions, MAC, LUT, …
• Micro-DMA & stream interface for data
movement
• Programmable
• Using the generated C compiler
• Each CNN PE has a local data &
program memory
Low-cost, Low-power, Flexible CNN
SubsystemInterconnect
DMA
Shared
DMem
CNN Engine
Reconfigurable
Streaming Interconnect
PE 1 …PE 2 PE 4
PE 5 PE 6 PE 8…
RISC
MP
32 bit
RISC
32 bit
RISC
32 bit
RISC
32 bit
RISC
Sync

Mapping Example and Performance
L1&4 FIFO L2
L3a
L3b
Subsystem Interconnect
L1 L2 L3 L4
• Input image read only once
• 30 cycles average to do 8
convolutions of 5x5 in parallel
• Including all data movement
& contention
• Over 85% MAC resource
utilization (8 MACs / CNN PE)
• ~15mW per PE @28nm HPM
• w. memory & interconnect
• Mapping on 4
processing elements
• Smaller layers merged
together
4 PE, 5 FIFO configuration

Demonstrator
ARC EV52 Processor
RISC multi-core Shared
Data
Mem
CNN Engine
DMA
AXI Subsystem Interconnect
PE 8
Core 2
MEM
PE 1
Core 1
MEM
AXI Interconnect
DDR
ARC HS Core
• Read in frame,
• Pyramid (scaling)
• Non-max suppression
• Softmax
• Display the result
AXI 2
UMRBus
CNN graph
Host application
streaming video
frames to DDR over
UMR-bus and back
HAPS 70-S12
Prototyping System
Clocked at 50Mhz
(10% of real-time)
Workstation
webcam

• CNN compute requirement can be dramatically reduced with a small impact
of the detection rates
• Works well when the number of object classes to detect is kept small
• Offline training is the critical step to obtain good performances
• Specialized and programmable hardware can be used to efficiently
implement many different CNN graphs
• Low power and area
• Some pre- and post-processing is needed to have a complete and useful
application
• CNN accelerator coupled with quad-core RISC cluster
• Useful to couple CNN with other processing steps to improve performances
• Shrinking the image when it doesn’t impact detection rates
• Sliding a detection window on an image
• Region of interest
Lessons Learned

• Selected CNN papers
• Embedded facial image processing with Convolutional Neural
Networks
• http://liris.cnrs.fr/Documents/Liris-6072.pdf
• Memory-Centric Accelerator Design for Convolutional Neural
Networks
• http://parse.ele.tue.nl/system/attachments/58/original/iccdMP17.pdf?1381908921
• CNN tutorial & courses
• Stanford CNN course
• http://cs231n.github.io/
• Neural network intro and visualization
• http://colah.github.io/
• Synopsys DesignWare Embedded Vision Processors
• http://www.synopsys.com/ev
• More information and demo available at the Technology Showcase
(Mission City Ballroom, Tables 3 & 4)
Resources

"Tailoring Convolutional Neural Networks for Low-Cost, Low-Power Implementation," a Presentation From Synopsys

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (17)

Más de Edge AI and Vision Alliance

Más de Edge AI and Vision Alliance (20)

Último

Último (20)

"Tailoring Convolutional Neural Networks for Low-Cost, Low-Power Implementation," a Presentation From Synopsys