Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018

[course
site]

Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Convolutional Neural
Networks
Day 4 Lecture 1
#DLUPC

Index

•  Mo-va-on:

•  Local
connec-vity

•  Parameter
sharing

•  Pooling
and
subsampling

•  Layers

•  Convolu-onal

•  Pooling

•  Fully
connected

•  Ac-va-on
func-ons

•  Batch
normaliza-on

•  Upsampling

•  Examples

2

Neural
networks
for
visual
data

•  Example:
Image
recogni)on

•  Given
some
input
image,
iden-fy
which
object
it
contains

4

sun
ﬂower

Caltech101
dataset

image
size
150x112

neurons
connected
to
16800
inputs

Neural
networks
for
visual
data

•  We
can
design
neural
networks
that
are
speciﬁcally
adapted
for
such
problems

•  must
deal
with
very
high-‐dimensional
inputs

•  150
x
112
pixels
=
16800
inputs,
or
3
x
16800
if
RGB
pixels

•  can
exploit
the
2D
topology
of
pixels
(or
3D
for
video
data)

•  can
build
in
invariance
to
certain
varia-ons
we
can
expect

•  transla-ons,
illumina-on,
etc.

•  Convolu)onal
networks
are
a
specialized
kind
of
neural
network
for
processing
data

that
has
a
known,
grid-‐like
topology.
They
leverage
these
ideas:

•  local
connec-vity

•  parameter
sharing

•  pooling
/
subsampling
hidden
units

5
S.
Credit:
H.
Larochelle

Convolu)onal
neural
networks

Local
connec)vity

•  First
idea:
use
a
local
connec-vity
of
hidden
units

•  each
hidden
unit
is
connected
only
to
a

subregion
(patch)
of
the
input
image:
recep)ve
ﬁeld

•  it
is
connected
to
all
channels

•  1
if
greyscale
image

•  3
(R,
G,
B)
for
color
image

•  …

•  Solves
the
following
problems:

•  fully
connected
hidden
layer
would
have

an
unmanageable
number
of
parameters

•  compu-ng
the
linear
ac-va-ons
of
the

hidden
units
would
be
very
expensive

6

S.
Credit:
H.
Larochelle

Convolu)onal
neural
networks

Parameter
sharing

•  Second
idea:
share
matrix
of
parameters
across
certain
units

•  units
organized
into
the
same
‘‘feature
map’’
share
parameters

•  hidden
units
within
a
feature
map
cover
diﬀerent
posi-ons
in
the
image

•  Solves
the
following
problems:

•  reduces
even
more
the
number
of
parameters

•  will
extract
the
same
features
at
every
posi-on
(features
are
‘‘equivariant’’)

7

Wij
is
the
matrix

connec-ng
the
ith

input
channel

with
the
jth

feature
map

S.
Credit:
H.
Larochelle

Convolu)onal
neural
networks

Parameter
sharing

•  Each
feature
map
forms
a
2D
grid
of
features

•  can
be
computed
with
a
discrete
convolu)on
of
a
kernel
matrix
kij

which
is
the
hidden
weights
matrix
Wij
with
its
rows
and
columns
ﬂipped

8

Convolu-ons

Input
image
Feature
Maps
yj
= f ( kij
∗ xi
)
i
∑
xi
ith
channel
of
input,
yj
hidden
layer

Convolu)onal
neural
networks

•  Convolu-on
as
feature
extrac-on:

applying
a
ﬁlterbank

•  but
ﬁlters
are
learned

9

Input
Feature
Map

.
.
.

Convolu)onal
neural
networks

Pooling
and
subsampling

•  Third
idea:
pool
hidden
units
in
same
neighborhood

•  pooling
is
performed
in
non-‐overlapping
neighborhoods
(subsampling)

•  an
alterna-ve
to
‘‘max’’
pooling
is
‘‘average’’
pooling

•  pooling
reduces
dimensionality
and
provides
invariance
to
small
local
changes

10

Max

yi
( j,k) = max
p,q
xi
( j + p,k + q)

Convolu)onal
Neural
Networks

•  Convolu)onal
neural
networks
alternate
between
convolu)onal
layers
(followed
by

a
nonlinearity)
and
pooling
layers
(basic
architecture)

•  For
recogni-on:
output
layer
is
a
regular,
fully
connected
layer
with
sohmax
non-‐
linearity

•  output
provides
an
es-mate
of
the
condi-onal
probability
of
each
class

•  The
network
is
trained
by
stochas-c
gradient
descent
(&
variants)

•  backpropaga-on
is
used
similarly
as
in
a
fully
connected
network
11

train
the
weights
of
ﬁlters

Convolu)onal
Neural
Networks

12

CNN=
learning

hierarchical

representa)ons
with

increasing
levels
of

abstrac)on

End-‐to-‐end
training:

joint
op-miza-on
of

features
and
classiﬁer

Fig.
Credit:
DLBook

Example:
LeNet-‐5

13

•  LeCun
et
al.,
1998

MNIST
digit
classifica-on
problem

handwriken
digits

60,000
training
examples

10,000
test
samples

10
classes

28x28
grayscale
images

Y.
LeCun,
L.
Bokou,
Y.
Bengio,
and
P.
Haffner,
Gradient-‐based
learning
applied
to
document
recogni-on,
1998.

Conv
filters
were
5x5,
applied
at
stride
1

Sigmoid
or
tanh
nonlinearity

Subsampling
(average
pooling)
layers
were
2x2
applied
at
stride
2

Fully
connected
layers
at
the
end

i.e.
architecture
is
[CONV-‐POOL-‐CONV-‐POOL-‐FC-‐FC]

Convolu)onal
Neural
Networks

15

A
regular
3-‐layer
Neural
Network A
ConvNet
with
3
layers

•  In
ConvNets
inputs
are
‘images’
(architecture
is
constrained)

•  A
ConvNet
arranges
neurons
in
three
dimensions
(width,
height,
depth)

•  Every
layer
transforms
3D
input
volume
to
a
3D
output
volume
of
neuron
ac-va-ons

input
layer

hidden
layer
1

hidden
layer2

outpt
layer

Convolu)onal
layer

•  Convolu-on
on
a
2D
grid

16
Image
source

Convolu)onal
layer

18

32x32x3
input

5x5x3
ﬁlter

Filters
always
extend
the
full
detph

of
the
input
volume

Convolve
the
ﬁlter
with
the
input

i.e.
slide
over
the
input
spa-ally,

compu-ng
dot
products

3
depth

32
width

32
height

•  Convolu-on
on
a
volume

Convolu)onal
layer

19

Ac-va-on
map
or

feature
map
28x28

Convolve
(slide)
over
all

spa-al
loca-ons

•  Convolu-on
on
a
volume

32x32x3
input

5x5x3
ﬁlter
w
Each
number:

the
result
of
the
dot
product
between
the
ﬁlter
and
a
small

5x5x3
patch
of
the
input:
5x5x3=75-‐dim
dot
product
+
bias

wtx+b
1

28

28

3

32

32

Convolu)onal
layer

20

32x32x3
input

5x5x3
ﬁlter
Ac-va-on
maps

Convolve
(slide)
over
all

spa-al
loca-ons

Consider
a
second
ﬁlter

1
28

28

Convolu)onal
layer

21

Convolu-onal
Layer

We
stack
the
maps
up
to
get
a
new
volume
of
size
28x28x6
If
we
have
6
5x5x3
filters,
we
get
6
separate
ac-va-on
maps
So
applying
a
filterbank
to
an
input
(3D
matrix)
yields
a
cube-‐like
output,
a

3D
matrix
in
which
each
slice
is
an
output
of
convolu-on
with
one
filter.

3

32

32

Ac-va-on
maps

6
28

28

Convolu)onal
layer

22

ConvNet
is
a
sequence
of
Convolu-onal
Layers,
interspersed
with
ac-va-on
func-ons

and
pooling
layers
(and
a
small
number
of
fully
connected
layers)
We
add
more
layers
of
filters.
We
apply
filters
(convolu-ons)
to
the
output
volume
of

the
previous
layer.
The
result
of
each
convolu-on
is
a
slice
in
the
new
volume.

S.
Credit:
Stanford
cs231_2017

3

32

32

6

28

28

10

24

24

CONV

ReLU

6
filters

5x5x3

CONV

ReLU

10
filters

5x5x6

CONV

ReLU

Example:
filters
and
ac)va)on
maps

23

Example
CNN
trained
for
image
recogni-on
on

CIFAR
dataset

The
network
learns
features
that
ac-vate
when

they
see
some
specific
type
of
feature
at
some

spa-al
posi-on
in
the
input.
Stacking
ac-va-on

maps
for
all
filters
along
depth
dimension
forms

the
full
output
volume

hkp://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

Convolu)onal
layer

24

stride
1:

7x7
input
(spa-ally)

a
3x3
filter:
5x5
output

Hyperparameters:
number
of
filters,
filter
spa-al
extent
F,
stride
S,
padding
P

Stride
is
the
number
of
pixels
by
which
we
slide
the
kernel
over
the
input
matrix.

Larger
stride
produce
smaller
feature
maps.

X X X X X
X X X X X

Convolu)onal
layer

25

Hyperparameters:
number
of
filters,
filter
spa-al
extent
F,
stride
S,
padding
P

Stride
is
the
number
of
pixels
by
which
we
slide
the
kernel
over
the
input
matrix.

Larger
stride
produce
smaller
feature
maps.

X X X
stride
2:

7x7
input
(spa-ally)

a
3x3
filter:
3x3
output

X X X

Convolu)onal
layer

26

No
padding
(P=0)

Output
size:
(N-‐F)/
S
+
1

e.g.
N=7,
F=3

stride
1:
(7-‐3)/1+1
=
5

stride
2:
(7-‐3)/2+1
=
3

stride
3:
(7-‐3)/3+1
=
2.33

not
applied

Padding
P=1

Output
size:
(N-‐F+2P)
/
S+1

e.g.
N=7,
F=3
,
S=1,
pad
with
1
pixel
border:
output
size
7x7

In
general,
CONV
layers
use
stride
1,
filters
FxF,
zero-‐
padding
with
P=
(F-‐1)/2
to
preserve
size
spa)ally

zero-‐padding

in
the
border:

Hyperparameters:
number
of
filters,
filter
spa-al
extent
F,
stride
S,
padding
P

Pad
the
input
volume
with
zeros
around
the
border
so
that
the
input
and
output
width
and

height
are
the
same

X
0 0 0 0 0 0 0 0 0
0 X 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0

1x1
convolu)ons

27

1x1
conv
with
64
ﬁlters

1x1
convolu-on
layers:
used
to
reduce
dimensionality
(number
of
feature
maps)
each
ﬁlter
has
size
1x1x128

and
performs
a
128-‐dimensional

dot
product
128

32

32

64

32

32

Example:
size,
parameters

28

Input
volume:
32x32x3

10
5x5
ﬁlters
with
stride
1,
padding
2

S.
Credit:
Stanford
cs231_2017

Output
volume
size:

(
N
+
2P
–
F
)
/
S
+
1

(32+2*2-‐5)/1+1=
32
spa-ally
so

32x32x10

Number
of
parameters
in
this
layer:

each
ﬁlter
has
5x5x3
+
1
=76
params
(
+1
for
bias)

-‐>

76x10
=
760
parameters

Summary:
conv
layer

To
summarize,
the
Conv
Layer

•  Accepts
a
volume
of
size
W1
x
H1
x
D1

•  Requires
four
hyperparameters:

•  number
of
filters
K

•  kernel
size
F

•  stride
S

•  amount
of
zero
padding
P

•  Produces
a
volume
of
size
W2
x
H2
x
D2

•  W2
=
(W1
–
F
+
2P)
/
S
+
1

•  H2
=
(H1
–
F
+
2P)
/
S
+
1

•  D2
=
K

•  With
parameter
sharing,
it
introduces
F.F.D1
weights
per
filter,
for
a
total
of
(F.F.D1).K

weights
and
K
biases

•  In
the
output
volume,
the
dth
depth
slice
(of
size
W2xH2)
is
the
result
of
performing
a
valid

convolu-on
of
the
dth
filter
over
the
input
volume
with
a
stride
of
S,
and
then
offset
by
dth
bias

29

Common
sevngs:

K
=
powers
of
2
(32,64,128,256)

F=3,
S=1,
P=1

F=5,
S=1,
P=2

F=5,
S=2,
P=?
(whatever
fits)

F=1,
S=1,
P=0

S.
Credit:
Stanford
cs231_2017

Ac)va)on
func)ons

•  Desirable
proper)es:
mostly
smooth,
con-nuous,
diﬀeren-able,
fairly
linear

30

Sigmoid

tanh

ReLu

Leaky
ReLu

Maxout

ELU

tanh x =
ex
− e−x
ex
+ e−x
σ (x) =
1
1+ e−x
max(0.1x,x)
max(0,x)
max(w1
T
x + b1
,w2
T
x + b2
)

Ac)va)on
func)ons

31

•  Example

Pooling
layer

•  Makes
the
representa-ons
smaller
and
more
manageable
for
later
layers

•  Useful
to
get
invariance
to
small
local
changes

•  Operates
over
each
ac-va-on
map
independently

32

pooling

downsampling

180x120

90x60

180x120x64
90x60x64

Pooling
layer

•  Max
pooling

•  Other
pooling
func-ons:
average
pooling

33

single
depth
slice

max
pool
with
2x2
ﬁlter

and
stride
2

4
1
5
2
2
6

1
2
9
0
2
4

2
2
6
4
0
2

3
1
0
3
3
1

4
9
6

3
6
3

Pooling
layer

•  Example

34

Summary:
pooling
layer

To
summarize,
the
pooling
layer

•  Accepts
a
volume
of
size
W1xH1xD1

•  Requires
two
hyperparameters

•  spa)al
extent
F

•  stride
S

•  Produces
a
volume
of
size
W2xH2xD2

•  W2
=
(W1-‐F)
/
S
+
1

•  H2
=
(H1-‐F)
/
S
+
1

•  D2
=
D1

•  Introduces
zero
parameters
since
it
computes
a
ﬁxed
func-on
of
the
input

•  Common
sevngs:

F=2,
S=2

or

F=3,
S=2

35

Pros:

-‐  reduces
the
number
of
inputs
to
the
next

layer,
allowing
us
to
have
more
feature
maps

-‐  invariant
to
small
transla-ons
of
the
input

Cons:

-‐  aher
several
layers
of
pooling
we
have
lost

informa-on
about
precise
posi-on
of
things

Fully
connected
layer

•  In
the
end
it
is
common
to
add
one
or
more
fully
(or
densely
)
connected
layers.

•  Every
neuron
in
the
previous
layer
is
connected
to
every
neuron
in
the
next
layer
(as

in
regular
neural
networks).
Ac-va-on
is
computed
as
matrix
mul-plica-on
plus
bias

•  At
the
output,
sohmax
ac-va-on
for
classiﬁca-on

36

connec-ons
and
weights

not
shown
here

4
possible
outputs

the
output
of
the
last
convolu-onal
layer
is
ﬂakened
to

a
single
vector
which
is
input
to
a
fully
connected
layer

Fully
connected
layers
and
Convolu)onal
layers

•  A
convolu-onal
layer
can
be
implemented
as
a
fully
connected
layer

•  The
weight
matrix
is
a
large
matrix
that
is
mostly
zero
except
for
certain

blocks
(due
to
local
connec-vity)

37

I
input
image
4x4

vectorized
to
16x1
(Iv)

Yv
output
image
4x1
(later
reshaped
2x2)

h
3x3
kernel;

C
16x4
(weights)

Y= I ∗h YV
= C.IV
x0
x1
x2
x3

x4
x5
x6
x7

x8
x9
x10
x11

x12
x13
x14
x15

w00
w01
w02

w10
w11
w12

w20
w21
w23

y0
y1

y2
y3

=
*

Fully
connected
layers
and
Convolu)onal
layers

38

Y= I ∗h YV
= C.IV
I
input
image
4x4

vectorized
to
16x1
(Iv)

Yv
output
image
4x1
(later
reshaped
2x2)

h
3x3
kernel;

C
16x4
(weights)

Fully
connected
layers
and
Convolu)onal
layers

•  Fully
connected
layers
can
also
be
viewed
as
convolu-ons
with
kernels
that

cover
the
en-re
input
region

Example:

A
fully
convolu-onal
layer
with
K=1024
neurons,
input
volume

32x32x512

can
be
expressed
as
a
convolu-onal
layer
with

K=1024
filters
F=32
(size
of
kernel),
P=0,
S=1

The
filter
size
is
exactly
the
size
of
the
input
volume;
the
output
is
1x1x1024

39

32

32

512

F=32

F=32

512

output
1x1x1

input
32x32x512
filter
32x32x512

with
1024
filters

output
1x1x1024

Batch
normaliza)on
layer

•  As
learning
progresses,
the
distribu-on
of
the
layer
inputs
changes
due

to
parameter
updates
(
internal
covariate
shih)

•  This
can
result
in
most
inputs
being
in
the
non-‐linear
regime
of

the
ac-va-on
func-on,
slowing
down
learning

•  Bach
normaliza-on
is
a
technique
to
reduce
this
eﬀect

•  Explicitly
force
the
layer
ac-va-ons
to
have
zero
mean
and
unit
variance

w.r.t
running
batch
es-mates

•  Adds
a
learnable
scale
and
bias
term
to
allow
the
network
to
s-ll
use
the

nonlinearity

40

Ioﬀe
and
Szegedy,
2015.
“Batch
normaliza-on:
accelera-ng
deep
network
training
by
reducing
internal
covariate
shih”

FC
/
Conv

Batch
norm

ReLu

FC
/
Conv

Batch
norm

ReLu

ˆx(k )
=
x(k )
− E(x(k )
)
var(x(k )
)
y(k )
= γ (k )
ˆx(k )
+ β(k )

Upsampling
layers:
recovering
spa)al
shape

•  Mo-va-on:
seman-c
segmenta-on.
Make
predic-ons
for
all
pixels
at
once

41

Problem:
convolu-ons
at
original
image
resolu-on
will
be
very
expensive

S.
Credit:
Stanford
cs231_2017

Upsampling
layers:
recovering
spa)al
shape

•  Mo-va-on:
seman-c
segmenta-on.
Make
predic-ons
for
all
pixels
at
once

42

Design
a
network
as
a
sequence
of
convolu-onal
layers
with
downsampling
and
upsampling

Other
applica)ons:
super-‐resolu)on,
ﬂow
es)ma)on,
genera)ve
modeling

S.
Credit:
Stanford
cs231_2017

Learnable
upsampling

•  Recall:
3x3
convolu-on,
stride
1,
pad
1

43
S.
Credit:
Stanford
cs231_2017

Learnable
upsampling

•  Recall:
3x3
convolu-on,
stride
2,
pad
1

44

Filter
moves
2
pixels
in
the
input

for
every
one
pixel
in
the
output

Stride
gives
ra-o
between

movement
in
input
and
output

S.
Credit:
Stanford
cs231_2017

Learnable
upsampling:
transposed
convolu)on

•  3x3
transposed
convolu-on,
stride
2,
pad1

45

Sum
where

output
overlaps

Filter
moves
2
pixels
in
the
output

for
every
one
pixel
in
the
input

Stride
gives
ra-o
between

movement
in
input
and
output

Various
names

-‐transposed
convolu-on

-‐backward
strided
convolu-on,

-‐frac-onally
strided
convolu-on,

-‐upconvolu-on,

-‐“deconvolu-on”

S.
Credit:
Stanford
cs231_2017

Some
architectures

for
Visual
Recogni-on

46

ImageNet:
ILSVRC

47

Image
Classiﬁca-on

1000
object
classes
(categories)

Images:

-‐  1.2
M
train

-‐  100.000
test.

Metric:
top
5
error
rate
(predict
5
classes)

Large
Scale
Visual
Recogni-on
Challenge

www.image-‐net.org/challenges/LSVRC/

ILSVRC
image
classiﬁca)on
winners

48

AlexNet
(2012)

•  Similar
framework
to
LeNet:

•  8
layers
(5
convolu-onal,
3
fully
connected)

•  Max
pooling,
ReLu
nonlineari-es

•  650,000
units,
60
million
parameters

•  trained
on
two
GPUs
(half
of
the
kernels
on
each
GPU)
for
a
week

•  data
augmenta-on

•  dropout
regulariza-on

49

A.
Krizhevsky,
I.
Sutskever,
and
G.
Hinton,
ImageNet
Classiﬁca-on
with
Deep
Convolu-onal
Neural
Networks,
NIPS
2012

AlexNet
(2012)

50

Full
AlexNet
architecture:

8
layers

[227x227x3]
INPUT

[55x55x96]
CONV1:
96
11x11
filters
at
stride
4,
pad
0

[27x27x96]
MAX
POOL1:
3x3
filters
at
stride
2

[27x27x96]
NORM1:
Normaliza-on
layer

[27x27x256]
CONV2:
256
5x5
filters
at
stride
1,
pad
2

[13x13x256]
MAX
POOL2:
3x3
filters
at
stride
2

[13x13x256]
NORM2:
Normaliza-on
layer

[13x13x384]
CONV3:
384
3x3
filters
at
stride
1,
pad
1

[13x13x384]
CONV4:
384
3x3
filters
at
stride
1,
pad
1

[13x13x256]
CONV5:
256
3x3
filters
at
stride
1,
pad
1

[6x6x256]
MAX
POOL3:
3x3
filters
at
stride
2

[4096]
FC6:
4096
neurons

[4096]
FC7:
4096
neurons

[1000]
FC8:
1000
neurons
(class
scores)

Details/Retrospec)ves:

-‐
first
use
of
ReLU

-‐
used
Norm
layers
(not
common)

-‐
heavy
data
augmenta-on

-‐
dropout
0.5

-‐
batch
size
128

-‐
SGD
Momentum
0.9

-‐
Learning
rate
1e-‐2,
reduced
by
10
manually

when
val
accuracy
plateaus

-‐
L2
weight
decay
5e-‐4

ILSVRC
2012
winner

7
CNN
ensemble:
18.2%
-‐>
15.4%

AlexNet
(2012)

•  Visualiza-on
of
the
96
11x11
filters
learned
by
the
first
layer

51

A.
Krizhevsky,
I.
Sutskever,
and
G.
Hinton,
ImageNet
Classifica-on
with
Deep
Convolu-onal
Neural
Networks,
NIPS
2012

VGGNet-‐16
(2014)

52
K.
Simonyan
and
A.
Zisserman,
Very
Deep
Convolu-onal
Networks
for
Large-‐Scale
Image
Recogni-on,
ICLR
2015

Visual
Geometry
Group
from
Univ.
Oxford

Seq.
of
deeper
nets
trained
progressively

Large
recep-ve
ﬁelds
replaced
by
3x3
conv

Only:

3x3
CONV
stride
1,
pad
1
and

2x2
MAX
POOL
stride
2

16
-‐19
layers

Shows
that
depth
is
a
cri-cal
component

for
good
performance

TOTAL
memory:
24M
*
4
bytes
~=

93MB
/
image

(only
forward!
~*2
for
bwd)

most
memory
is
in
the
early
CONV

TOTAL
params:
138M
parameters

most
parameters
are
in
late
FC

GoogLeNet
(2014)

Mo)va)on:

•  The
most
straigh•orward
way
of
improving
the
performance
of
deep
neural
networks
is
by

increasing
their
size,
both
depth
and
width

•  Increasing
the
network
size
has
two
drawbacks:

•  means
a
larger
number
or
parameters-‐>
prone
to
overﬁvng

•  drama-cally
increased
use
of
computa-onal
resources

•  Goal:
increase
the
depth
and
width
while
keeping
the
computa-onal
budget
constant

53

C.
Szegedy
et
al.,
Going
deeper
with
convolu-ons,
CVPR
2015

Compared
to
AlexNet

12x
less
parameters
(5M
vs
60M)

2x
more
compute
6.67%
(vs
16,4%)

22
layers

GoogLeNet
(2014)

The
Incep-on
Module

•  Apply
parallel
opera-ons
on
the
input
from
previous
layer:

•  mul-ple
kernel
size
for
convolu-on
(1x1,
3x3,
5x5)

•  pooling
opera-on

•  Concatenate
all
ﬁlter
outputs
together
depth-‐wise

•  Use
1x1
convolu-ons
for
dimensionality
reduc-on
before
expensive
convolu-ons

•  Geometry

54

conv
ops:

1x1
conv,
128:

28x28x128x1x1x256

1x1conv,
64:
28x28x64x1x1x256

3x3
conv,
192:
28x28x192x3x3x64

1x1
conv,
64:
28x28x64x1x1x256

5x5
conv,
96:
28x28x96x5x5x64

1x1
conv,
64:
28x28x64x1x1x256

Total:
358M
ops

without
1x1
convolu-ons:

Total:
854M
ops

GoogLeNet
(2014)

Architecture

•  A
stem
network

•  Stacked
incep-on
modules

55

Convolu-on

Pooling

Other

GoogLeNet
(2014)

56

•  Auxililary
classifiers

•  features
produced
by
the
layers
in
the
middle
of
the
network
should
be
very
discrimina-ve

•  auxiliary
classifiers
connected
to
these
intermediate
layers,
discrimina-on
in
the
lower
stages
in

the
classifier
was
expected

•  during
training,
their
loss
gets
added
to
the
total
loss
of
the
network
with
a
discount
weight

(the
losses
of
the
aux
classifers
are
weighted
by
0.3)

•  at
inference
-me,
the
auxiliary
classifiers
are
discarded

...and
no
fully
connected
layers
needed
!

Auxiliary
classifier

Convolu-on

Pooling

Sohmax

ResNet
(2015)

57
Kaiming
He,
Xiangyu
Zhang,
Shaoqing
Ren,
and
Jian
Sun,
Deep
Residual
Learning
for
Image
Recogni-on,
CVPR
2016
(Best
Paper)

Mo-va-on

•  Stacking
more
layers
does
not
mean
beker
performance

•  with
the
network
depth
increasing,
accuracy
gets
saturated
and
degrades
rapidly

•  such
degrada-on
is
not
caused
by
overﬁvng

ResNet
(2015)

58
Kaiming
He,
Xiangyu
Zhang,
Shaoqing
Ren,
and
Jian
Sun,
Deep
Residual
Learning
for
Image
Recogni-on,
CVPR
2016
(Best
Paper)

Residual
block

•  Hypothesis:
the
problem
is
an
op-miza-on
problem,
deeper
models
are
harder
to
op-mize

•  The
deeper
model
should
be
able
to
perform
at
least
as
well
as
the
shallower
model

•  The
added
layers
are
iden-ty
mapping,
the
other
layers
are
copied
from
the
learned
shallower
model

•  Solu)on:
use
network
layers
to
ﬁt
a
residual
mapping
instead
of
directly
trying
to
ﬁt
a
desired

underying
model

Plain
net
Residual
net

ResNet
(2015)

•  Similar
to
GoogLeNet,
use
bokleneck
layer
to
improve
eﬃciency

59

•  Directly
performing
3x3
convolu-ons
with
256

feature
maps
at
input
and
output:

256
x
256
x
3
x
3
~
600K
opera-ons

•  Using
1x1
convolu-ons
to
reduce
256
to
64

feature
maps,
followed
by
3x3
convolu-ons,

followed
by
1x1
convolu-ons
to
expand
back
to

256
maps:

256
x
64
x
1
x
1
~
16K

64
x
64
x
3
x
3
~
36K

64
x
256
x
1
x
1
~
16K

Total:
~70K

Kaiming
He,
Xiangyu
Zhang,
Shaoqing
Ren,
and
Jian
Sun,
Deep
Residual
Learning
for
Image
Recogni-on,
CVPR
2016
(Best
Paper)

ResNet
(2015)

60

ILSVRC
2015
winner
(3.6%
top
5
error)

MSRA:
ILSVRC
&
COCO
2015
compe--ons

-‐  ImageNet
Classiﬁca-on:
“ultra
deep”,
152-‐leyers

-‐  ImageNet
Detec-on:
16%
beker
than
2nd

-‐  ImageNet
Localiza-on:
27%
beker
than
2nd

-‐  COCO
Detec-on:
11%
beker
than
2nd

-‐  COCO
Segmenta-on:
12%
beker
than
2nd

2-‐3
weeks
of
training
on
8
GPU
machine

at
run-me:
faster
than
a
VGGNet!

(even
though
it
has
8x
more
layers)

ILSVRC
2012-‐2015
summary

61

Team
Year
Place
Error
(top-‐5)
External
data

SuperVision
–
Toronto

(AlexNet,
7
layers)

2012

-‐
16.4%
no

SuperVision
2012

1st
15.3%
ImageNet
22k

Clarifai
–
NYU
(7
layers)
2013
-‐
11.7%
no

Clarifai
2013
1st
11.2%
ImageNet
22k

VGG
–
Oxford
(16
layers)
2014
2nd
7.32%
no

GoogLeNet
(19
layers)
2014
1st
6.67%
no

ResNet
(152
layers)
2015
1st
3.57%

Human
expert*
5.1%

hkp://karpathy.github.io/2014/09/02/what-‐i-‐learned-‐from-‐compe-ng-‐against-‐a-‐convnet-‐on-‐imagenet/

Summary

•  Convolu-onal
neural
networks
are
a
specialized
kind
of
neural
network
for
processing
data

that
has
a
known,
grid-‐like
topology

•  CNNs
leverage
these
ideas:

•  local
connec-vity

•  parameter
sharing

•  pooling
/
subsampling
hidden
units

•  Layers:
convolu-onal,
non-‐linear
ac-va-on,
pooling,
upsampling,
batch
normaliza-on

•  Architectures
for
object
recogni-on
in
images

•  LeNet:
pioneer
net
for
digit
recogni-on

•  AlexNet:
smaller
compute,
s-ll
memory
heavy,
lower
accuracy

•  VGG:
highest
memory,
most
opera-ons

•  GoogLeNet:
most
eﬃcient

•  ResNet:
moderate
eﬀciency
depending
on
model,
beker
accuracy

•  Incep-on-‐v4:
hybrid
of
ResNet
and
incep-on,
highest
accuracy

62

Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018

Similar a Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018 (20)

Más de Universitat Politècnica de Catalunya

Más de Universitat Politècnica de Catalunya (20)

Último

Último (20)

Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018