Decomposing image generation into layout priction and conditional synthesis
nips report
1. Artistic Style Learning
1
2
Bowen Sun Zixuan Wang3
Department of Computing Science Department of Computing Science4
Simon Fraser University Simon Fraser University5
Burnaby, B.C. Canada. V5A 1S6 zwa72@sfu.ca6
Bsa58@sfu.ca7
8
ABSTRACT9
In our project, we implemented a neural network approach about artistic10
style learning. Aside from implementation, we also improved the algorithm11
by replacing some part of the algorithm. This algorithm is mainly based on12
the VGG convolution neural network and speeded up by using the NVidia13
CUDA computing platform.14
1 Introduction15
Artistic style is a very interesting area in neuroscience and how human brain perceive one16
paint’s content and style is still unknown. However, in the recent paper “A Neural Algorithm17
of Artistic Style” [1], the neuroscience researchers gave us a pass way to understand how18
human brain do this job using VGG convolution neural network [2]. Their intuition can be19
divided into two part: 1) content reconstruction; 2) style representation. For content20
reconstruction, they defined a loss function which depended on the output of the convolution21
layers. For style representation, at some of the convolution layers, they also defined a Gram22
matrix which represent the correlation between different filters at the corresponding layer23
and also defined the loss function for these Gram matrix [3]. By mixing these two loss24
functions and calculating the derivative to the image they were going to generate, they can25
use the stochastic gradient descent to iteratively improve the image and generate the final26
result. In our project, we mainly implemented the algorithm in this paper and modified some27
part of it. This report is divided into four parts. The first part is the introduction to the artistic28
style algorithm, VGG neural network and NVidia CUDA platform as well as our29
implementation detail. The second part is our experiments result. The third part is the30
experiment result. The fourth part is our project’s conclusion. And the final part is our team31
members’ contributions.32
2 Approach33
In this section, the detailed approach of this project will be introduced. The results were34
generated with the help of a 19-layer VGG network, which is a convolutional neural network35
having great performance on object recognition task. And we utilized the output of the 1336
convolutional layers and 4 pooling layers to define loss functions for content as well as the37
style, which were the target error functions in gradient descent for the content reconstruction38
and style representation. For the image synthesis, we replace the original rectified non-linear39
activation function with the softplus function, since it has the similar output as rectified40
non-linearity yet more smooth derivative flow. And according to [1], using average pooling41
instead of max pooling usually has better result in generation of new pictures.42
For the recombination of the content and style, we used an error function defined as a linear43
combination of the loss of content and style. As can be seen in section 4, using gradient44
descent again we generated different mixture effect with different ratio of the coefficients for45
content and style loss functions.46
2.1 VGG 19-layer network47
48
2. The network we used for this project is the VGG 19-layer network, a convolutional neural network49
rivals the human perfiormance on object recognition in ImageNet challenge [2], which is50
extensively introduced in [3]. The original network contains 16 convolutional layers after each of51
which there will be a rectification non-linear activation function, 5 max pooling layers, 3 fully52
connected layers and 1 softmax layer. And the information of the padding, weights value for filter53
banks, the size and stride for filter and pools can be found online. The layers our project using can54
be shown as follow:55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
Figure 170
Since the method used to reconstruct the picture from the output of the high level convolutional71
layer is gradient descent, the smoothness of the derivative flow is quite important. As can be seen72
from the following graph, the softplus function has similar output as the rectification function, yet73
more smooth derivative. The difference between ReLU and Softplus is as follow:74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
Figure 291
For the same reason, the max pooling layers are replaced with the average pooling layers with the92
same size and stride of the pool as the work in [1].93
2.2 Content reconstruction94
Generally, the output of the filter banks for each convolutional layer can be seen as the filter95
response of a given input picture. In this case, the output of the filter banks is called the feature96
maps, each feature map is actually a 2-D matrix encoding the output of the layer before it on a97
given filter. The number of the feature map is equal to the number of filter in each layer. To avoid98
the big confusing crowd of the subscripts, suppose we reshape the 2-D feature map to a 1-D99
vector. Layer 𝒍 with 𝑵𝒍 different filters has 𝑵𝒍 different feature maps or vectors of size 𝑴𝒍,100
𝑴𝒍 is the product of the width and height of the feature map, also, the length of the reshaped101
feature vector. So the response of a convolutional layer 𝒍 can be stored in a matrix 𝑭𝒍
∈ 𝑹 𝑵 𝒍×𝑴 𝒍,102
where 𝑭𝒊𝒋
𝒍
is value at position 𝒋 of the feature vector generated by filter 𝒊 at layer 𝒍. Let 𝒑⃗⃗ and103
𝑷𝒍
denote the picture and the activation of layer 𝒍, 𝒙⃗⃗ denotes the picture on which we want to104
Input image
Conv 1_1
Conv 1_2
Average pool1
Conv 2_1
Conv 2_2
Average pool2
Conv 3_1
Conv 3_2
Conv 3_3
Conv 3_4
Average pool3
Conv 4_1
Conv 4_2
Conv 4_3
Conv 4_4
Average pool4
Conv 5_1
3. reconstruct the content in 𝒑⃗⃗ , the loss function of the content at layer 𝒍 is defined as follows:105
𝑳 𝒄𝒐𝒏𝒕𝒆𝒏𝒕(𝒑⃗⃗ , 𝒙⃗⃗ , 𝒍) =
𝟏
𝟐
∑(𝑭𝒊𝒋
𝒍
− 𝑷𝒊𝒋
𝒍
)
𝟐
𝒊,𝒋
106
And the corresponding derivative is the following:107
𝝏𝑳 𝒄𝒐𝒏𝒕𝒆𝒏𝒕
𝝏𝑭𝒊𝒋
𝒍
= (𝑭𝒍
− 𝑷𝒍
)𝒊𝒋
108
Then after taking derivative from layer 𝒍 to the input layer, we got the gradient of filter responses109
with respect to the input picture 𝒙⃗⃗ , with which we can perform the gradient descent to train the110
picture 𝒙⃗⃗ as to have the same content as 𝒑⃗⃗ .111
2.3 Style representation112
To obtain the information of style, we used a feature space designed to capture the texture113
information which is introduced in [4]. So what we have done here is to use texture transfer to114
achieve the style representation. To get the correlation of different filter outputs in the same layer,115
we compute the gram matrix 𝑮 where 𝑮𝒊𝒋
𝒍
is the inner product of different reshaped feature116
vectors.117
𝑮𝒊𝒋
𝒍
= ∑ 𝑭𝒊𝒌
𝒍
𝑭𝒋𝒌
𝒍
𝒌
118
Let 𝒂⃗⃗ and 𝑨𝒍
denote the picture, the style of which we want to extract, and 𝑬𝒍 denotes the error119
of style at layer 𝒍 . Then the loss function of style and the derivative of 𝑬𝒍 with respect to the120
activation of layer 𝒍 can be defined as following:121
𝑳 𝒔𝒕𝒚𝒍𝒆(𝒂⃗⃗ , 𝒙⃗⃗ ) = ∑ 𝒘𝒍 𝑬𝒍
𝑳
𝒍=𝟎
122
𝑬𝒍 =
𝟏
𝟒𝑵𝒍
𝟐
𝑴𝒍
𝟐 ∑(𝑮𝒊𝒋
𝒍
− 𝑨𝒊𝒋
𝒍
)
𝟐
123
𝝏𝑬𝒍
𝝏𝑭𝒊𝒋
𝒍
=
𝟏
𝑵𝒍
𝟐
𝑴𝒍
𝟐
((𝑭𝒍
)
𝑻
(𝑮𝒍
− 𝑨𝒍
))
𝒋𝒊
124
For style representation, we used a linear combination of the errors on different layers to improve125
the result, same as [4].126
2.4 Recombination of the content and style127
To generate a new picture which combines the content of picture 𝒑⃗⃗ and the style of painting 𝒂⃗⃗ ,128
we simply used gradient descent to minimize the total loss function defined as following:129
𝒍𝒕𝒐𝒕𝒂𝒍(𝒑⃗⃗ , 𝒂⃗⃗ , 𝒙⃗⃗ ) = 𝜶𝒍 𝒄𝒐𝒏𝒕𝒆𝒏𝒕(𝒑⃗⃗ , 𝒙⃗⃗ ) + 𝜷𝒍 𝒔𝒕𝒚𝒍𝒆(𝒂⃗⃗ , 𝒙⃗⃗ )130
The ratio of the coefficient 𝜶 and 𝜷 here can control the mixture effect, as can be foreseen, the131
generated picture will be more inclined to the style with larger ratio of
𝛽
𝛼⁄ .132
2.5 Development Environment133
In this project, we mainly used a toolbox of Matlab called Matconvnet, implementing the134
computer version of convolutional neural network. Besides, the most time consuming operation in135
this project is the computation of the derivatives in the gradient descent. So we installed the Cuda136
which is a parallel platform invented by NVIDIA, dramatically improved the computing137
performance by utilization of the power of GPU. The version of the software and tool we used in138
this project is the following:139
Matlab2014a140
Cuda6.5141
Matconvnet1.0-beta16142
Visual studio 2013143
3 Experiment144
145
4. In this section, the detailed information of experiment and the result will be introduced. Compared146
to the result generated from [1], the effect of mixture is slightly better using softplus function in147
the place of rectification function. Due to the step type of derivative of rectified function, the148
picture generated in the same way as [1] contains lots of white holes. Since we used the smoother149
softplus function, the generated picture only contains small waves on the picture.150
3.1 Content Reconstruction151
When Convolutional Neural Networks are trained on object recognition, they develop a152
representation of the image that makes object information increasingly explicit along the153
processing hierarchy [3]. Therefore, the input image is transformed into filter responses that154
increasingly care about the actual content of the image compared to its detailed pixel values.155
For the experiment of content reconstruction, we used the following picture and try to reconstruct156
its content using the loss function on the error of (a) Conv1_1 (b) Conv2_1 (c) Conv3_1 (d)157
Conv4_1 (e) Conv5_1. Our experiment result is as follow:158
159
160
161
162
163
164
Photo (a) (b)165
(c) (d) (e)166
Figure 3: photo reconstruction. a) Conv1_1; b) Conv2_1; c) Conv3_1; d) Conv4_1; e) Conv5_1.167
As we can see from the result, reconstruction of the image using output from higher level mainly168
captures the arrangement of the object and the contour of items in the picture instead of the exact169
pixel value compared to the reconstruction using output from lower level.170
3.2 Style Representation171
For style representation, we used the feature space computing the correlation of different filter172
responses, which produced the texturized picture of the painting capturing the usage and localized173
structure of color instead of the content.174
For the experiment of style representation, we used the famous artwork The Starry Night of Van175
Gogh, and the result of style representation using gradient on the loss function of (a) Conv1_1 (b)176
Conv1_1, Conv2_1 (c) Conv1_1, Conv2_1, Conv3_1 (d) Conv1_1, Conv2_1, Conv3_1, Conv4_1177
(e) Conv1_1, Conv2_1, Conv3_1, Conv4_1, Conv5_1. The weights for errors of different layers178
are all 1 divided by the number of layers counting in the total loss function. Our experiment result179
is as follow:180
181
182
183
184
185
186
Style (a) (b) (c)187
(d) (e)188
Figure 4: style representation. a) Conv1_1; b) Conv2_1; c) Conv3_1; d) Conv4_1; e) Conv5_1.189
3.3 Recombination of the content and style190
However, the content and style can’t be completely separated, there is usually no picture can191
perfectly match the content and style from different sources. For the recombination of the content192
and style from different picture, we used the gradient descent to minimize the joint loss function193
defined on well separated loss function on content and style respectively. Therefore, we can194
regulate the emphasis on the content or style with different ratio of
𝛽
𝛼⁄ . In the experiment, we195
tried to render a photo taken at SFU Burnaby campus with the style of (a) Der Schrei, (b) The196
shipwreck of minotaur, (c) High mountains with the ratio of 100, 1000, and 10000.197
5. 198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
Photo Style 10218
1000 10000219
Figure 5: experiment result with different styles and different ratios220
4 Conclusion221
In this project, we implemented an artistic style learning algorithm and also improve the222
algorithm’s performance by replacing the ReLU activation function in the original VGG223
convolution neural network with the Softplus activation function.224
5 Contribution225
Setting up platform: Zixuan Wang.226
Derived the forward and backward propagation formula: Bowen Sun227
Wrote and implemented prototype derivative function: Bowen Sun.228
Improved derivative function: Zixuan Wang.229
Reference230
[1] Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "A neural algorithm of artistic style."231
arXiv preprint arXiv:1508.06576 (2015).232
[2] Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J].233
International Journal of Computer Vision, 2014: 1-42.234
[3] Gatys, L. A., Ecker, A. S. & Bethge, M. "Texture synthesis and the controlled generation of natural235
stimuli using convolutional neural networks. " arXiv:1505.07376 [cs, q-bio] (2015).236
[4] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale237
image recognition." arXiv preprint arXiv:1409.1556 (2014).238