5. OVERVIEW
1. Intro style transfer
2. Convolutional Neural Networks
3. Gatys - A Neural Algorithm of Artistic Style
4. Improvements
6.
7. Image courtesy: Matthieu Cord : Deep CNN and Weak Supervision Learning for visual recognition, https://blog.heuritech.com/2016/02/29/a-brief-report-of-the-heuritech-deep-
learning-meetup-5/
HOW DOES A CNN WORK?
9. 17
32
32
3
Convolution Layer
5x5x3 filter
32x32x3 image
Convolve the filter with the image
i.e. “slide over the image spatially,
computing dot products”
Slide credit: CS231n Lecture 7
Slide courtesy: Johnson, cs231n lecture 7, http://web.stanford.edu/class/cs20si/lectures/slides_06.pdf
10. 18
32
32
3
Convolution Layer
5x5x3 filter
32x32x3 image
Convolve the filter with the image
i.e. “slide over the image spatially,
computing dot products”
Filters always extend the full
depth of the input volume
Slide credit: CS231n Lecture 7
Slide courtesy: Johnson, cs231n lecture 7, http://web.stanford.edu/class/cs20si/lectures/slides_06.pdf
11. 19
32
32
3
Convolution Layer
32x32x3 image
5x5x3 filter
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
(i.e. 5*5*3 = 75-dimensional dot product + bias)
Slide credit: CS231n Lecture 7
Slide courtesy: Johnson, cs231n lecture 7, http://web.stanford.edu/class/cs20si/lectures/slides_06.pdf
13. 21
32
32
3
Convolution Layer
32x32x3 image
5x5x3 filter
convolve (slide) over all
spatial locations
activation maps
1
28
28
consider a second, green filter
Slide credit: CS231n Lecture 7
Slide courtesy: Johnson, cs231n lecture 7, http://web.stanford.edu/class/cs20si/lectures/slides_06.pdf
14. 22
32
32
3
Convolution Layer
activation maps
6
28
28
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
We stack these up to get a “new image” of size 28x28x6!
Slide credit: CS231n Lecture 7
Slide courtesy: Johnson, cs231n lecture 7, http://web.stanford.edu/class/cs20si/lectures/slides_06.pdf
20. RECONSTRUCTING CONTENT
➤ Given image, how can we find a
new one with the same content?
➤ Find content distance measure
between images
➤ Start from random noise image
➤ Minimize distance through iteration
Image courtesy: D. Ulyanov https://bayesgroup.github.io/bmml_sem/2016/style.pdf
21. 1. Load a pre-trained CNN (e.g. VGG19)
2. Pass image #1 through the net
3. Save activation maps from conv-layers
4. Pass image #2 through the net
5. Save activation maps from conv-layers
6. Calculate Euclidean distance between
activation maps from image #1 and #2
and sum up for all layers
CONTENT DISTANCE MEASURE
Image courtesy: Gatys et al., Texture Synthesis Using Convolutional Neural Networks, https://arxiv.org/pdf/1505.07376.pdf
Lcontent (x, ˆx) =
1
2
wl (Al (x)− Al ( ˆx))2
l
∑
x ˆx
22. RECONSTRUCTING CONTENT
➤ Start from random image
➤ Update it using gradient descent
Lcontent (x, ˆx) =
1
2
wl (Al (x)− Al ( ˆx))2
l
∑
ˆxt+1 = ˆxt − ε
∂Lcontent
∂ ˆx
Image courtesy: D. Ulyanov, https://bayesgroup.github.io/bmml_sem/2016/style.pdf
23. RECONSTRUCTING CONTENT
➤ Start from random image
➤ Update it using gradient descent
Lcontent (x, ˆx) =
1
2
wl (Al (x)− Al ( ˆx))2
l
∑
ˆxt+1 = ˆxt − ε
∂Lcontent
∂ ˆx
24. 55
Reconstructions from intermediate layers
Higher layers are less sensitive to changes in
color, texture, and shape
Mahendran and Vedaldi, “Understanding Deep Image Representations by Inverting Them”, CVPR 2015
Feature Inversion
Slide courtesy: Johnson, http://web.stanford.edu/class/cs20si/lectures/slides_06.pdf
25. 54
Feature Inversion
Reconstructions from the representation after last last pooling layer
(immediately before the first Fully Connected layer)
Mahendran and Vedaldi, “Understanding Deep Image Representations by Inverting Them”, CVPR 2015
Slide courtesy: Johnson, http://web.stanford.edu/class/cs20si/lectures/slides_06.pdf
34. MATHEMATICAL SIDE NOTE
Special case of square of Maximum Mean Discrepancy (MMD)
with
Further reading: Demystifying Style Transfer, Li et al.
Lstyle(x, ˆx) =
1
2
wl (Gl
(x)− Gl
( ˆx))2
l
∑
Lstyle
l
=
1
Zk
l
MMD2
(Al
(x),Al
( ˆx))
= E[φ(Al
(x))]− E[φ(Al
( ˆx))]
2
=
1
Zk
l
(k(A:,i
l
,A:, j
l
)+k( ˆA:,i
l
, ˆA:, j
l
)
j=1
Ml
∑
i=1
Ml
∑ + 2k(A:,i
l
, ˆA:, j
l
))
k(x, ˆx) = (xT
ˆx)2
40. PERCEPTUAL LOSSES FOR REAL-TIME STYLE TRANSFER AND SUPER-RESOLUTION
➤ Train a network to do the optimization
➤ + Fast
➤ - One network per style
➤ - Quantitatively slightly worse
Image courtesy: Johnson et al., Perceptual Losses for Real-Time Style Transfer and Super-Resolution, https://arxiv.org/abs/1603.08155
41. ARBITRARY STYLE TRANSFER IN REAL-TIME WITH ADAPTIVE INSTANCE NORMALIZATION
Image courtesy: Huang et al., Arbitrary style transfer in real-time with adaptive instance normalization
AdaIN(xc,xs ) = σ (xs )
xc − µ(xc )
σ (xc )
⎛
⎝⎜
⎞
⎠⎟ + µ(xs )
➤ Align mean and variance for activation maps
➤ + Fast (15 fps, 512x512px)
➤ + One net, arbitrary style
➤ - Quantitatively slightly worse