https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
1. [course
site]
Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Convolutional Neural
Networks
Day 4 Lecture 1
#DLUPC
4. Neural
networks
for
visual
data
• Example:
Image
recogni)on
• Given
some
input
image,
iden-fy
which
object
it
contains
4
sun
flower
Caltech101
dataset
image
size
150x112
neurons
connected
to
16800
inputs
5. Neural
networks
for
visual
data
• We
can
design
neural
networks
that
are
specifically
adapted
for
such
problems
• must
deal
with
very
high-‐dimensional
inputs
• 150
x
112
pixels
=
16800
inputs,
or
3
x
16800
if
RGB
pixels
• can
exploit
the
2D
topology
of
pixels
(or
3D
for
video
data)
• can
build
in
invariance
to
certain
varia-ons
we
can
expect
• transla-ons,
illumina-on,
etc.
• Convolu)onal
networks
are
a
specialized
kind
of
neural
network
for
processing
data
that
has
a
known,
grid-‐like
topology.
They
leverage
these
ideas:
• local
connec-vity
• parameter
sharing
• pooling
/
subsampling
hidden
units
5
S.
Credit:
H.
Larochelle
6. Convolu)onal
neural
networks
Local
connec)vity
• First
idea:
use
a
local
connec-vity
of
hidden
units
• each
hidden
unit
is
connected
only
to
a
subregion
(patch)
of
the
input
image:
recep)ve
field
• it
is
connected
to
all
channels
• 1
if
greyscale
image
• 3
(R,
G,
B)
for
color
image
• …
• Solves
the
following
problems:
• fully
connected
hidden
layer
would
have
an
unmanageable
number
of
parameters
• compu-ng
the
linear
ac-va-ons
of
the
hidden
units
would
be
very
expensive
6
S.
Credit:
H.
Larochelle
7. Convolu)onal
neural
networks
Parameter
sharing
• Second
idea:
share
matrix
of
parameters
across
certain
units
• units
organized
into
the
same
‘‘feature
map’’
share
parameters
• hidden
units
within
a
feature
map
cover
different
posi-ons
in
the
image
• Solves
the
following
problems:
• reduces
even
more
the
number
of
parameters
• will
extract
the
same
features
at
every
posi-on
(features
are
‘‘equivariant’’)
7
Wij
is
the
matrix
connec-ng
the
ith
input
channel
with
the
jth
feature
map
S.
Credit:
H.
Larochelle
8. Convolu)onal
neural
networks
Parameter
sharing
• Each
feature
map
forms
a
2D
grid
of
features
• can
be
computed
with
a
discrete
convolu)on
of
a
kernel
matrix
kij
which
is
the
hidden
weights
matrix
Wij
with
its
rows
and
columns
flipped
8
Convolu-ons
Input
image
Feature
Maps
yj
= f ( kij
∗ xi
)
i
∑
xi
ith
channel
of
input,
yj
hidden
layer
9. Convolu)onal
neural
networks
• Convolu-on
as
feature
extrac-on:
applying
a
filterbank
• but
filters
are
learned
9
Input
Feature
Map
.
.
.
10. Convolu)onal
neural
networks
Pooling
and
subsampling
• Third
idea:
pool
hidden
units
in
same
neighborhood
• pooling
is
performed
in
non-‐overlapping
neighborhoods
(subsampling)
• an
alterna-ve
to
‘‘max’’
pooling
is
‘‘average’’
pooling
• pooling
reduces
dimensionality
and
provides
invariance
to
small
local
changes
10
Max
yi
( j,k) = max
p,q
xi
( j + p,k + q)
11. Convolu)onal
Neural
Networks
• Convolu)onal
neural
networks
alternate
between
convolu)onal
layers
(followed
by
a
nonlinearity)
and
pooling
layers
(basic
architecture)
• For
recogni-on:
output
layer
is
a
regular,
fully
connected
layer
with
sohmax
non-‐
linearity
• output
provides
an
es-mate
of
the
condi-onal
probability
of
each
class
• The
network
is
trained
by
stochas-c
gradient
descent
(&
variants)
• backpropaga-on
is
used
similarly
as
in
a
fully
connected
network
11
train
the
weights
of
filters
12. Convolu)onal
Neural
Networks
12
CNN=
learning
hierarchical
representa)ons
with
increasing
levels
of
abstrac)on
End-‐to-‐end
training:
joint
op-miza-on
of
features
and
classifier
Fig.
Credit:
DLBook
13. Example:
LeNet-‐5
13
• LeCun
et
al.,
1998
MNIST
digit
classifica-on
problem
handwriken
digits
60,000
training
examples
10,000
test
samples
10
classes
28x28
grayscale
images
Y.
LeCun,
L.
Bokou,
Y.
Bengio,
and
P.
Haffner,
Gradient-‐based
learning
applied
to
document
recogni-on,
1998.
Conv
filters
were
5x5,
applied
at
stride
1
Sigmoid
or
tanh
nonlinearity
Subsampling
(average
pooling)
layers
were
2x2
applied
at
stride
2
Fully
connected
layers
at
the
end
i.e.
architecture
is
[CONV-‐POOL-‐CONV-‐POOL-‐FC-‐FC]
15. Convolu)onal
Neural
Networks
15
A
regular
3-‐layer
Neural
Network A
ConvNet
with
3
layers
• In
ConvNets
inputs
are
‘images’
(architecture
is
constrained)
• A
ConvNet
arranges
neurons
in
three
dimensions
(width,
height,
depth)
• Every
layer
transforms
3D
input
volume
to
a
3D
output
volume
of
neuron
ac-va-ons
input
layer
hidden
layer
1
hidden
layer2
outpt
layer
17. Convolu)onal
layer
18
32x32x3
input
5x5x3
filter
Filters
always
extend
the
full
detph
of
the
input
volume
Convolve
the
filter
with
the
input
i.e.
slide
over
the
input
spa-ally,
compu-ng
dot
products
3
depth
32
width
32
height
• Convolu-on
on
a
volume
18. Convolu)onal
layer
19
Ac-va-on
map
or
feature
map
28x28
Convolve
(slide)
over
all
spa-al
loca-ons
• Convolu-on
on
a
volume
32x32x3
input
5x5x3
filter
w
Each
number:
the
result
of
the
dot
product
between
the
filter
and
a
small
5x5x3
patch
of
the
input:
5x5x3=75-‐dim
dot
product
+
bias
wtx+b
1
28
28
3
32
32
19. Convolu)onal
layer
20
32x32x3
input
5x5x3
filter
Ac-va-on
maps
Convolve
(slide)
over
all
spa-al
loca-ons
Consider
a
second
filter
1
28
28
20. Convolu)onal
layer
21
Convolu-onal
Layer
We
stack
the
maps
up
to
get
a
new
volume
of
size
28x28x6
If
we
have
6
5x5x3
filters,
we
get
6
separate
ac-va-on
maps
So
applying
a
filterbank
to
an
input
(3D
matrix)
yields
a
cube-‐like
output,
a
3D
matrix
in
which
each
slice
is
an
output
of
convolu-on
with
one
filter.
3
32
32
Ac-va-on
maps
6
28
28
21. Convolu)onal
layer
22
ConvNet
is
a
sequence
of
Convolu-onal
Layers,
interspersed
with
ac-va-on
func-ons
and
pooling
layers
(and
a
small
number
of
fully
connected
layers)
We
add
more
layers
of
filters.
We
apply
filters
(convolu-ons)
to
the
output
volume
of
the
previous
layer.
The
result
of
each
convolu-on
is
a
slice
in
the
new
volume.
S.
Credit:
Stanford
cs231_2017
3
32
32
6
28
28
10
24
24
CONV
ReLU
6
filters
5x5x3
CONV
ReLU
10
filters
5x5x6
CONV
ReLU
22. Example:
filters
and
ac)va)on
maps
23
Example
CNN
trained
for
image
recogni-on
on
CIFAR
dataset
The
network
learns
features
that
ac-vate
when
they
see
some
specific
type
of
feature
at
some
spa-al
posi-on
in
the
input.
Stacking
ac-va-on
maps
for
all
filters
along
depth
dimension
forms
the
full
output
volume
hkp://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
23. Convolu)onal
layer
24
stride
1:
7x7
input
(spa-ally)
a
3x3
filter:
5x5
output
Hyperparameters:
number
of
filters,
filter
spa-al
extent
F,
stride
S,
padding
P
Stride
is
the
number
of
pixels
by
which
we
slide
the
kernel
over
the
input
matrix.
Larger
stride
produce
smaller
feature
maps.
X X X X X
X X X X X
24. Convolu)onal
layer
25
Hyperparameters:
number
of
filters,
filter
spa-al
extent
F,
stride
S,
padding
P
Stride
is
the
number
of
pixels
by
which
we
slide
the
kernel
over
the
input
matrix.
Larger
stride
produce
smaller
feature
maps.
X X X
stride
2:
7x7
input
(spa-ally)
a
3x3
filter:
3x3
output
X X X
25. Convolu)onal
layer
26
No
padding
(P=0)
Output
size:
(N-‐F)/
S
+
1
e.g.
N=7,
F=3
stride
1:
(7-‐3)/1+1
=
5
stride
2:
(7-‐3)/2+1
=
3
stride
3:
(7-‐3)/3+1
=
2.33
not
applied
Padding
P=1
Output
size:
(N-‐F+2P)
/
S+1
e.g.
N=7,
F=3
,
S=1,
pad
with
1
pixel
border:
output
size
7x7
In
general,
CONV
layers
use
stride
1,
filters
FxF,
zero-‐
padding
with
P=
(F-‐1)/2
to
preserve
size
spa)ally
zero-‐padding
in
the
border:
Hyperparameters:
number
of
filters,
filter
spa-al
extent
F,
stride
S,
padding
P
Pad
the
input
volume
with
zeros
around
the
border
so
that
the
input
and
output
width
and
height
are
the
same
X
0 0 0 0 0 0 0 0 0
0 X 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
26. 1x1
convolu)ons
27
1x1
conv
with
64
filters
1x1
convolu-on
layers:
used
to
reduce
dimensionality
(number
of
feature
maps)
each
filter
has
size
1x1x128
and
performs
a
128-‐dimensional
dot
product
128
32
32
64
32
32
27. Example:
size,
parameters
28
Input
volume:
32x32x3
10
5x5
filters
with
stride
1,
padding
2
S.
Credit:
Stanford
cs231_2017
Output
volume
size:
(
N
+
2P
–
F
)
/
S
+
1
(32+2*2-‐5)/1+1=
32
spa-ally
so
32x32x10
Number
of
parameters
in
this
layer:
each
filter
has
5x5x3
+
1
=76
params
(
+1
for
bias)
-‐>
76x10
=
760
parameters
28. Summary:
conv
layer
To
summarize,
the
Conv
Layer
• Accepts
a
volume
of
size
W1
x
H1
x
D1
• Requires
four
hyperparameters:
• number
of
filters
K
• kernel
size
F
• stride
S
• amount
of
zero
padding
P
• Produces
a
volume
of
size
W2
x
H2
x
D2
• W2
=
(W1
–
F
+
2P)
/
S
+
1
• H2
=
(H1
–
F
+
2P)
/
S
+
1
• D2
=
K
• With
parameter
sharing,
it
introduces
F.F.D1
weights
per
filter,
for
a
total
of
(F.F.D1).K
weights
and
K
biases
• In
the
output
volume,
the
dth
depth
slice
(of
size
W2xH2)
is
the
result
of
performing
a
valid
convolu-on
of
the
dth
filter
over
the
input
volume
with
a
stride
of
S,
and
then
offset
by
dth
bias
29
Common
sevngs:
K
=
powers
of
2
(32,64,128,256)
F=3,
S=1,
P=1
F=5,
S=1,
P=2
F=5,
S=2,
P=?
(whatever
fits)
F=1,
S=1,
P=0
S.
Credit:
Stanford
cs231_2017
29. Ac)va)on
func)ons
• Desirable
proper)es:
mostly
smooth,
con-nuous,
differen-able,
fairly
linear
30
Sigmoid
tanh
ReLu
Leaky
ReLu
Maxout
ELU
tanh x =
ex
− e−x
ex
+ e−x
σ (x) =
1
1+ e−x
max(0.1x,x)
max(0,x)
max(w1
T
x + b1
,w2
T
x + b2
)
31. Pooling
layer
• Makes
the
representa-ons
smaller
and
more
manageable
for
later
layers
• Useful
to
get
invariance
to
small
local
changes
• Operates
over
each
ac-va-on
map
independently
32
pooling
downsampling
180x120
90x60
180x120x64
90x60x64
32. Pooling
layer
• Max
pooling
• Other
pooling
func-ons:
average
pooling
33
single
depth
slice
max
pool
with
2x2
filter
and
stride
2
4
1
5
2
2
6
1
2
9
0
2
4
2
2
6
4
0
2
3
1
0
3
3
1
4
9
6
3
6
3
34. Summary:
pooling
layer
To
summarize,
the
pooling
layer
• Accepts
a
volume
of
size
W1xH1xD1
• Requires
two
hyperparameters
• spa)al
extent
F
• stride
S
• Produces
a
volume
of
size
W2xH2xD2
• W2
=
(W1-‐F)
/
S
+
1
• H2
=
(H1-‐F)
/
S
+
1
• D2
=
D1
• Introduces
zero
parameters
since
it
computes
a
fixed
func-on
of
the
input
• Common
sevngs:
F=2,
S=2
or
F=3,
S=2
35
Pros:
-‐ reduces
the
number
of
inputs
to
the
next
layer,
allowing
us
to
have
more
feature
maps
-‐ invariant
to
small
transla-ons
of
the
input
Cons:
-‐ aher
several
layers
of
pooling
we
have
lost
informa-on
about
precise
posi-on
of
things
35. Fully
connected
layer
• In
the
end
it
is
common
to
add
one
or
more
fully
(or
densely
)
connected
layers.
• Every
neuron
in
the
previous
layer
is
connected
to
every
neuron
in
the
next
layer
(as
in
regular
neural
networks).
Ac-va-on
is
computed
as
matrix
mul-plica-on
plus
bias
• At
the
output,
sohmax
ac-va-on
for
classifica-on
36
connec-ons
and
weights
not
shown
here
4
possible
outputs
the
output
of
the
last
convolu-onal
layer
is
flakened
to
a
single
vector
which
is
input
to
a
fully
connected
layer
36. Fully
connected
layers
and
Convolu)onal
layers
• A
convolu-onal
layer
can
be
implemented
as
a
fully
connected
layer
• The
weight
matrix
is
a
large
matrix
that
is
mostly
zero
except
for
certain
blocks
(due
to
local
connec-vity)
37
I
input
image
4x4
vectorized
to
16x1
(Iv)
Yv
output
image
4x1
(later
reshaped
2x2)
h
3x3
kernel;
C
16x4
(weights)
Y= I ∗h YV
= C.IV
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
w00
w01
w02
w10
w11
w12
w20
w21
w23
y0
y1
y2
y3
=
*
37. Fully
connected
layers
and
Convolu)onal
layers
38
Y= I ∗h YV
= C.IV
I
input
image
4x4
vectorized
to
16x1
(Iv)
Yv
output
image
4x1
(later
reshaped
2x2)
h
3x3
kernel;
C
16x4
(weights)
38. Fully
connected
layers
and
Convolu)onal
layers
• Fully
connected
layers
can
also
be
viewed
as
convolu-ons
with
kernels
that
cover
the
en-re
input
region
Example:
A
fully
convolu-onal
layer
with
K=1024
neurons,
input
volume
32x32x512
can
be
expressed
as
a
convolu-onal
layer
with
K=1024
filters
F=32
(size
of
kernel),
P=0,
S=1
The
filter
size
is
exactly
the
size
of
the
input
volume;
the
output
is
1x1x1024
39
32
32
512
F=32
F=32
512
output
1x1x1
input
32x32x512
filter
32x32x512
with
1024
filters
output
1x1x1024
39. Batch
normaliza)on
layer
• As
learning
progresses,
the
distribu-on
of
the
layer
inputs
changes
due
to
parameter
updates
(
internal
covariate
shih)
• This
can
result
in
most
inputs
being
in
the
non-‐linear
regime
of
the
ac-va-on
func-on,
slowing
down
learning
• Bach
normaliza-on
is
a
technique
to
reduce
this
effect
• Explicitly
force
the
layer
ac-va-ons
to
have
zero
mean
and
unit
variance
w.r.t
running
batch
es-mates
• Adds
a
learnable
scale
and
bias
term
to
allow
the
network
to
s-ll
use
the
nonlinearity
40
Ioffe
and
Szegedy,
2015.
“Batch
normaliza-on:
accelera-ng
deep
network
training
by
reducing
internal
covariate
shih”
FC
/
Conv
Batch
norm
ReLu
FC
/
Conv
Batch
norm
ReLu
ˆx(k )
=
x(k )
− E(x(k )
)
var(x(k )
)
y(k )
= γ (k )
ˆx(k )
+ β(k )
40. Upsampling
layers:
recovering
spa)al
shape
• Mo-va-on:
seman-c
segmenta-on.
Make
predic-ons
for
all
pixels
at
once
41
Problem:
convolu-ons
at
original
image
resolu-on
will
be
very
expensive
S.
Credit:
Stanford
cs231_2017
41. Upsampling
layers:
recovering
spa)al
shape
• Mo-va-on:
seman-c
segmenta-on.
Make
predic-ons
for
all
pixels
at
once
42
Design
a
network
as
a
sequence
of
convolu-onal
layers
with
downsampling
and
upsampling
Other
applica)ons:
super-‐resolu)on,
flow
es)ma)on,
genera)ve
modeling
S.
Credit:
Stanford
cs231_2017
42. Learnable
upsampling
• Recall:
3x3
convolu-on,
stride
1,
pad
1
43
S.
Credit:
Stanford
cs231_2017
43. Learnable
upsampling
• Recall:
3x3
convolu-on,
stride
2,
pad
1
44
Filter
moves
2
pixels
in
the
input
for
every
one
pixel
in
the
output
Stride
gives
ra-o
between
movement
in
input
and
output
S.
Credit:
Stanford
cs231_2017
44. Learnable
upsampling:
transposed
convolu)on
• 3x3
transposed
convolu-on,
stride
2,
pad1
45
Sum
where
output
overlaps
Filter
moves
2
pixels
in
the
output
for
every
one
pixel
in
the
input
Stride
gives
ra-o
between
movement
in
input
and
output
Various
names
-‐transposed
convolu-on
-‐backward
strided
convolu-on,
-‐frac-onally
strided
convolu-on,
-‐upconvolu-on,
-‐“deconvolu-on”
S.
Credit:
Stanford
cs231_2017
48. AlexNet
(2012)
• Similar
framework
to
LeNet:
• 8
layers
(5
convolu-onal,
3
fully
connected)
• Max
pooling,
ReLu
nonlineari-es
• 650,000
units,
60
million
parameters
• trained
on
two
GPUs
(half
of
the
kernels
on
each
GPU)
for
a
week
• data
augmenta-on
• dropout
regulariza-on
49
A.
Krizhevsky,
I.
Sutskever,
and
G.
Hinton,
ImageNet
Classifica-on
with
Deep
Convolu-onal
Neural
Networks,
NIPS
2012
49. AlexNet
(2012)
50
Full
AlexNet
architecture:
8
layers
[227x227x3]
INPUT
[55x55x96]
CONV1:
96
11x11
filters
at
stride
4,
pad
0
[27x27x96]
MAX
POOL1:
3x3
filters
at
stride
2
[27x27x96]
NORM1:
Normaliza-on
layer
[27x27x256]
CONV2:
256
5x5
filters
at
stride
1,
pad
2
[13x13x256]
MAX
POOL2:
3x3
filters
at
stride
2
[13x13x256]
NORM2:
Normaliza-on
layer
[13x13x384]
CONV3:
384
3x3
filters
at
stride
1,
pad
1
[13x13x384]
CONV4:
384
3x3
filters
at
stride
1,
pad
1
[13x13x256]
CONV5:
256
3x3
filters
at
stride
1,
pad
1
[6x6x256]
MAX
POOL3:
3x3
filters
at
stride
2
[4096]
FC6:
4096
neurons
[4096]
FC7:
4096
neurons
[1000]
FC8:
1000
neurons
(class
scores)
Details/Retrospec)ves:
-‐
first
use
of
ReLU
-‐
used
Norm
layers
(not
common)
-‐
heavy
data
augmenta-on
-‐
dropout
0.5
-‐
batch
size
128
-‐
SGD
Momentum
0.9
-‐
Learning
rate
1e-‐2,
reduced
by
10
manually
when
val
accuracy
plateaus
-‐
L2
weight
decay
5e-‐4
ILSVRC
2012
winner
7
CNN
ensemble:
18.2%
-‐>
15.4%
50. AlexNet
(2012)
• Visualiza-on
of
the
96
11x11
filters
learned
by
the
first
layer
51
A.
Krizhevsky,
I.
Sutskever,
and
G.
Hinton,
ImageNet
Classifica-on
with
Deep
Convolu-onal
Neural
Networks,
NIPS
2012
51. VGGNet-‐16
(2014)
52
K.
Simonyan
and
A.
Zisserman,
Very
Deep
Convolu-onal
Networks
for
Large-‐Scale
Image
Recogni-on,
ICLR
2015
Visual
Geometry
Group
from
Univ.
Oxford
Seq.
of
deeper
nets
trained
progressively
Large
recep-ve
fields
replaced
by
3x3
conv
Only:
3x3
CONV
stride
1,
pad
1
and
2x2
MAX
POOL
stride
2
16
-‐19
layers
Shows
that
depth
is
a
cri-cal
component
for
good
performance
TOTAL
memory:
24M
*
4
bytes
~=
93MB
/
image
(only
forward!
~*2
for
bwd)
most
memory
is
in
the
early
CONV
TOTAL
params:
138M
parameters
most
parameters
are
in
late
FC
52. GoogLeNet
(2014)
Mo)va)on:
• The
most
straigh•orward
way
of
improving
the
performance
of
deep
neural
networks
is
by
increasing
their
size,
both
depth
and
width
• Increasing
the
network
size
has
two
drawbacks:
• means
a
larger
number
or
parameters-‐>
prone
to
overfivng
• drama-cally
increased
use
of
computa-onal
resources
• Goal:
increase
the
depth
and
width
while
keeping
the
computa-onal
budget
constant
53
C.
Szegedy
et
al.,
Going
deeper
with
convolu-ons,
CVPR
2015
Compared
to
AlexNet
12x
less
parameters
(5M
vs
60M)
2x
more
compute
6.67%
(vs
16,4%)
22
layers
53. GoogLeNet
(2014)
The
Incep-on
Module
• Apply
parallel
opera-ons
on
the
input
from
previous
layer:
• mul-ple
kernel
size
for
convolu-on
(1x1,
3x3,
5x5)
• pooling
opera-on
• Concatenate
all
filter
outputs
together
depth-‐wise
• Use
1x1
convolu-ons
for
dimensionality
reduc-on
before
expensive
convolu-ons
• Geometry
54
conv
ops:
1x1
conv,
128:
28x28x128x1x1x256
1x1conv,
64:
28x28x64x1x1x256
3x3
conv,
192:
28x28x192x3x3x64
1x1
conv,
64:
28x28x64x1x1x256
5x5
conv,
96:
28x28x96x5x5x64
1x1
conv,
64:
28x28x64x1x1x256
Total:
358M
ops
without
1x1
convolu-ons:
Total:
854M
ops
55. GoogLeNet
(2014)
56
• Auxililary
classifiers
• features
produced
by
the
layers
in
the
middle
of
the
network
should
be
very
discrimina-ve
• auxiliary
classifiers
connected
to
these
intermediate
layers,
discrimina-on
in
the
lower
stages
in
the
classifier
was
expected
• during
training,
their
loss
gets
added
to
the
total
loss
of
the
network
with
a
discount
weight
(the
losses
of
the
aux
classifers
are
weighted
by
0.3)
• at
inference
-me,
the
auxiliary
classifiers
are
discarded
...and
no
fully
connected
layers
needed
!
Auxiliary
classifier
Convolu-on
Pooling
Sohmax
56. ResNet
(2015)
57
Kaiming
He,
Xiangyu
Zhang,
Shaoqing
Ren,
and
Jian
Sun,
Deep
Residual
Learning
for
Image
Recogni-on,
CVPR
2016
(Best
Paper)
Mo-va-on
• Stacking
more
layers
does
not
mean
beker
performance
• with
the
network
depth
increasing,
accuracy
gets
saturated
and
degrades
rapidly
• such
degrada-on
is
not
caused
by
overfivng
57. ResNet
(2015)
58
Kaiming
He,
Xiangyu
Zhang,
Shaoqing
Ren,
and
Jian
Sun,
Deep
Residual
Learning
for
Image
Recogni-on,
CVPR
2016
(Best
Paper)
Residual
block
• Hypothesis:
the
problem
is
an
op-miza-on
problem,
deeper
models
are
harder
to
op-mize
• The
deeper
model
should
be
able
to
perform
at
least
as
well
as
the
shallower
model
• The
added
layers
are
iden-ty
mapping,
the
other
layers
are
copied
from
the
learned
shallower
model
• Solu)on:
use
network
layers
to
fit
a
residual
mapping
instead
of
directly
trying
to
fit
a
desired
underying
model
Plain
net
Residual
net
58. ResNet
(2015)
• Similar
to
GoogLeNet,
use
bokleneck
layer
to
improve
efficiency
59
• Directly
performing
3x3
convolu-ons
with
256
feature
maps
at
input
and
output:
256
x
256
x
3
x
3
~
600K
opera-ons
• Using
1x1
convolu-ons
to
reduce
256
to
64
feature
maps,
followed
by
3x3
convolu-ons,
followed
by
1x1
convolu-ons
to
expand
back
to
256
maps:
256
x
64
x
1
x
1
~
16K
64
x
64
x
3
x
3
~
36K
64
x
256
x
1
x
1
~
16K
Total:
~70K
Kaiming
He,
Xiangyu
Zhang,
Shaoqing
Ren,
and
Jian
Sun,
Deep
Residual
Learning
for
Image
Recogni-on,
CVPR
2016
(Best
Paper)
59. ResNet
(2015)
60
ILSVRC
2015
winner
(3.6%
top
5
error)
MSRA:
ILSVRC
&
COCO
2015
compe--ons
-‐ ImageNet
Classifica-on:
“ultra
deep”,
152-‐leyers
-‐ ImageNet
Detec-on:
16%
beker
than
2nd
-‐ ImageNet
Localiza-on:
27%
beker
than
2nd
-‐ COCO
Detec-on:
11%
beker
than
2nd
-‐ COCO
Segmenta-on:
12%
beker
than
2nd
2-‐3
weeks
of
training
on
8
GPU
machine
at
run-me:
faster
than
a
VGGNet!
(even
though
it
has
8x
more
layers)
60. ILSVRC
2012-‐2015
summary
61
Team
Year
Place
Error
(top-‐5)
External
data
SuperVision
–
Toronto
(AlexNet,
7
layers)
2012
-‐
16.4%
no
SuperVision
2012
1st
15.3%
ImageNet
22k
Clarifai
–
NYU
(7
layers)
2013
-‐
11.7%
no
Clarifai
2013
1st
11.2%
ImageNet
22k
VGG
–
Oxford
(16
layers)
2014
2nd
7.32%
no
GoogLeNet
(19
layers)
2014
1st
6.67%
no
ResNet
(152
layers)
2015
1st
3.57%
Human
expert*
5.1%
hkp://karpathy.github.io/2014/09/02/what-‐i-‐learned-‐from-‐compe-ng-‐against-‐a-‐convnet-‐on-‐imagenet/
61. Summary
• Convolu-onal
neural
networks
are
a
specialized
kind
of
neural
network
for
processing
data
that
has
a
known,
grid-‐like
topology
• CNNs
leverage
these
ideas:
• local
connec-vity
• parameter
sharing
• pooling
/
subsampling
hidden
units
• Layers:
convolu-onal,
non-‐linear
ac-va-on,
pooling,
upsampling,
batch
normaliza-on
• Architectures
for
object
recogni-on
in
images
• LeNet:
pioneer
net
for
digit
recogni-on
• AlexNet:
smaller
compute,
s-ll
memory
heavy,
lower
accuracy
• VGG:
highest
memory,
most
opera-ons
• GoogLeNet:
most
efficient
• ResNet:
moderate
effciency
depending
on
model,
beker
accuracy
• Incep-on-‐v4:
hybrid
of
ResNet
and
incep-on,
highest
accuracy
62