SlideShare una empresa de Scribd logo
1 de 5
Descargar para leer sin conexión
Artistic Style Learning
1
2
Bowen Sun Zixuan Wang3
Department of Computing Science Department of Computing Science4
Simon Fraser University Simon Fraser University5
Burnaby, B.C. Canada. V5A 1S6 zwa72@sfu.ca6
Bsa58@sfu.ca7
8
ABSTRACT9
In our project, we implemented a neural network approach about artistic10
style learning. Aside from implementation, we also improved the algorithm11
by replacing some part of the algorithm. This algorithm is mainly based on12
the VGG convolution neural network and speeded up by using the NVidia13
CUDA computing platform.14
1 Introduction15
Artistic style is a very interesting area in neuroscience and how human brain perceive one16
paint’s content and style is still unknown. However, in the recent paper “A Neural Algorithm17
of Artistic Style” [1], the neuroscience researchers gave us a pass way to understand how18
human brain do this job using VGG convolution neural network [2]. Their intuition can be19
divided into two part: 1) content reconstruction; 2) style representation. For content20
reconstruction, they defined a loss function which depended on the output of the convolution21
layers. For style representation, at some of the convolution layers, they also defined a Gram22
matrix which represent the correlation between different filters at the corresponding layer23
and also defined the loss function for these Gram matrix [3]. By mixing these two loss24
functions and calculating the derivative to the image they were going to generate, they can25
use the stochastic gradient descent to iteratively improve the image and generate the final26
result. In our project, we mainly implemented the algorithm in this paper and modified some27
part of it. This report is divided into four parts. The first part is the introduction to the artistic28
style algorithm, VGG neural network and NVidia CUDA platform as well as our29
implementation detail. The second part is our experiments result. The third part is the30
experiment result. The fourth part is our project’s conclusion. And the final part is our team31
members’ contributions.32
2 Approach33
In this section, the detailed approach of this project will be introduced. The results were34
generated with the help of a 19-layer VGG network, which is a convolutional neural network35
having great performance on object recognition task. And we utilized the output of the 1336
convolutional layers and 4 pooling layers to define loss functions for content as well as the37
style, which were the target error functions in gradient descent for the content reconstruction38
and style representation. For the image synthesis, we replace the original rectified non-linear39
activation function with the softplus function, since it has the similar output as rectified40
non-linearity yet more smooth derivative flow. And according to [1], using average pooling41
instead of max pooling usually has better result in generation of new pictures.42
For the recombination of the content and style, we used an error function defined as a linear43
combination of the loss of content and style. As can be seen in section 4, using gradient44
descent again we generated different mixture effect with different ratio of the coefficients for45
content and style loss functions.46
2.1 VGG 19-layer network47
48
The network we used for this project is the VGG 19-layer network, a convolutional neural network49
rivals the human perfiormance on object recognition in ImageNet challenge [2], which is50
extensively introduced in [3]. The original network contains 16 convolutional layers after each of51
which there will be a rectification non-linear activation function, 5 max pooling layers, 3 fully52
connected layers and 1 softmax layer. And the information of the padding, weights value for filter53
banks, the size and stride for filter and pools can be found online. The layers our project using can54
be shown as follow:55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
Figure 170
Since the method used to reconstruct the picture from the output of the high level convolutional71
layer is gradient descent, the smoothness of the derivative flow is quite important. As can be seen72
from the following graph, the softplus function has similar output as the rectification function, yet73
more smooth derivative. The difference between ReLU and Softplus is as follow:74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
Figure 291
For the same reason, the max pooling layers are replaced with the average pooling layers with the92
same size and stride of the pool as the work in [1].93
2.2 Content reconstruction94
Generally, the output of the filter banks for each convolutional layer can be seen as the filter95
response of a given input picture. In this case, the output of the filter banks is called the feature96
maps, each feature map is actually a 2-D matrix encoding the output of the layer before it on a97
given filter. The number of the feature map is equal to the number of filter in each layer. To avoid98
the big confusing crowd of the subscripts, suppose we reshape the 2-D feature map to a 1-D99
vector. Layer 𝒍 with 𝑵𝒍 different filters has 𝑵𝒍 different feature maps or vectors of size 𝑴𝒍,100
𝑴𝒍 is the product of the width and height of the feature map, also, the length of the reshaped101
feature vector. So the response of a convolutional layer 𝒍 can be stored in a matrix 𝑭𝒍
∈ 𝑹 𝑵 𝒍×𝑴 𝒍,102
where 𝑭𝒊𝒋
𝒍
is value at position 𝒋 of the feature vector generated by filter 𝒊 at layer 𝒍. Let 𝒑⃗⃗ and103
𝑷𝒍
denote the picture and the activation of layer 𝒍, 𝒙⃗⃗ denotes the picture on which we want to104
Input image
Conv 1_1
Conv 1_2
Average pool1
Conv 2_1
Conv 2_2
Average pool2
Conv 3_1
Conv 3_2
Conv 3_3
Conv 3_4
Average pool3
Conv 4_1
Conv 4_2
Conv 4_3
Conv 4_4
Average pool4
Conv 5_1
reconstruct the content in 𝒑⃗⃗ , the loss function of the content at layer 𝒍 is defined as follows:105
𝑳 𝒄𝒐𝒏𝒕𝒆𝒏𝒕(𝒑⃗⃗ , 𝒙⃗⃗ , 𝒍) =
𝟏
𝟐
∑(𝑭𝒊𝒋
𝒍
− 𝑷𝒊𝒋
𝒍
)
𝟐
𝒊,𝒋
106
And the corresponding derivative is the following:107
𝝏𝑳 𝒄𝒐𝒏𝒕𝒆𝒏𝒕
𝝏𝑭𝒊𝒋
𝒍
= (𝑭𝒍
− 𝑷𝒍
)𝒊𝒋
108
Then after taking derivative from layer 𝒍 to the input layer, we got the gradient of filter responses109
with respect to the input picture 𝒙⃗⃗ , with which we can perform the gradient descent to train the110
picture 𝒙⃗⃗ as to have the same content as 𝒑⃗⃗ .111
2.3 Style representation112
To obtain the information of style, we used a feature space designed to capture the texture113
information which is introduced in [4]. So what we have done here is to use texture transfer to114
achieve the style representation. To get the correlation of different filter outputs in the same layer,115
we compute the gram matrix 𝑮 where 𝑮𝒊𝒋
𝒍
is the inner product of different reshaped feature116
vectors.117
𝑮𝒊𝒋
𝒍
= ∑ 𝑭𝒊𝒌
𝒍
𝑭𝒋𝒌
𝒍
𝒌
118
Let 𝒂⃗⃗ and 𝑨𝒍
denote the picture, the style of which we want to extract, and 𝑬𝒍 denotes the error119
of style at layer 𝒍 . Then the loss function of style and the derivative of 𝑬𝒍 with respect to the120
activation of layer 𝒍 can be defined as following:121
𝑳 𝒔𝒕𝒚𝒍𝒆(𝒂⃗⃗ , 𝒙⃗⃗ ) = ∑ 𝒘𝒍 𝑬𝒍
𝑳
𝒍=𝟎
122
𝑬𝒍 =
𝟏
𝟒𝑵𝒍
𝟐
𝑴𝒍
𝟐 ∑(𝑮𝒊𝒋
𝒍
− 𝑨𝒊𝒋
𝒍
)
𝟐
123
𝝏𝑬𝒍
𝝏𝑭𝒊𝒋
𝒍
=
𝟏
𝑵𝒍
𝟐
𝑴𝒍
𝟐
((𝑭𝒍
)
𝑻
(𝑮𝒍
− 𝑨𝒍
))
𝒋𝒊
124
For style representation, we used a linear combination of the errors on different layers to improve125
the result, same as [4].126
2.4 Recombination of the content and style127
To generate a new picture which combines the content of picture 𝒑⃗⃗ and the style of painting 𝒂⃗⃗ ,128
we simply used gradient descent to minimize the total loss function defined as following:129
𝒍𝒕𝒐𝒕𝒂𝒍(𝒑⃗⃗ , 𝒂⃗⃗ , 𝒙⃗⃗ ) = 𝜶𝒍 𝒄𝒐𝒏𝒕𝒆𝒏𝒕(𝒑⃗⃗ , 𝒙⃗⃗ ) + 𝜷𝒍 𝒔𝒕𝒚𝒍𝒆(𝒂⃗⃗ , 𝒙⃗⃗ )130
The ratio of the coefficient 𝜶 and 𝜷 here can control the mixture effect, as can be foreseen, the131
generated picture will be more inclined to the style with larger ratio of
𝛽
𝛼⁄ .132
2.5 Development Environment133
In this project, we mainly used a toolbox of Matlab called Matconvnet, implementing the134
computer version of convolutional neural network. Besides, the most time consuming operation in135
this project is the computation of the derivatives in the gradient descent. So we installed the Cuda136
which is a parallel platform invented by NVIDIA, dramatically improved the computing137
performance by utilization of the power of GPU. The version of the software and tool we used in138
this project is the following:139
Matlab2014a140
Cuda6.5141
Matconvnet1.0-beta16142
Visual studio 2013143
3 Experiment144
145
In this section, the detailed information of experiment and the result will be introduced. Compared146
to the result generated from [1], the effect of mixture is slightly better using softplus function in147
the place of rectification function. Due to the step type of derivative of rectified function, the148
picture generated in the same way as [1] contains lots of white holes. Since we used the smoother149
softplus function, the generated picture only contains small waves on the picture.150
3.1 Content Reconstruction151
When Convolutional Neural Networks are trained on object recognition, they develop a152
representation of the image that makes object information increasingly explicit along the153
processing hierarchy [3]. Therefore, the input image is transformed into filter responses that154
increasingly care about the actual content of the image compared to its detailed pixel values.155
For the experiment of content reconstruction, we used the following picture and try to reconstruct156
its content using the loss function on the error of (a) Conv1_1 (b) Conv2_1 (c) Conv3_1 (d)157
Conv4_1 (e) Conv5_1. Our experiment result is as follow:158
159
160
161
162
163
164
Photo (a) (b)165
(c) (d) (e)166
Figure 3: photo reconstruction. a) Conv1_1; b) Conv2_1; c) Conv3_1; d) Conv4_1; e) Conv5_1.167
As we can see from the result, reconstruction of the image using output from higher level mainly168
captures the arrangement of the object and the contour of items in the picture instead of the exact169
pixel value compared to the reconstruction using output from lower level.170
3.2 Style Representation171
For style representation, we used the feature space computing the correlation of different filter172
responses, which produced the texturized picture of the painting capturing the usage and localized173
structure of color instead of the content.174
For the experiment of style representation, we used the famous artwork The Starry Night of Van175
Gogh, and the result of style representation using gradient on the loss function of (a) Conv1_1 (b)176
Conv1_1, Conv2_1 (c) Conv1_1, Conv2_1, Conv3_1 (d) Conv1_1, Conv2_1, Conv3_1, Conv4_1177
(e) Conv1_1, Conv2_1, Conv3_1, Conv4_1, Conv5_1. The weights for errors of different layers178
are all 1 divided by the number of layers counting in the total loss function. Our experiment result179
is as follow:180
181
182
183
184
185
186
Style (a) (b) (c)187
(d) (e)188
Figure 4: style representation. a) Conv1_1; b) Conv2_1; c) Conv3_1; d) Conv4_1; e) Conv5_1.189
3.3 Recombination of the content and style190
However, the content and style can’t be completely separated, there is usually no picture can191
perfectly match the content and style from different sources. For the recombination of the content192
and style from different picture, we used the gradient descent to minimize the joint loss function193
defined on well separated loss function on content and style respectively. Therefore, we can194
regulate the emphasis on the content or style with different ratio of
𝛽
𝛼⁄ . In the experiment, we195
tried to render a photo taken at SFU Burnaby campus with the style of (a) Der Schrei, (b) The196
shipwreck of minotaur, (c) High mountains with the ratio of 100, 1000, and 10000.197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
Photo Style 10218
1000 10000219
Figure 5: experiment result with different styles and different ratios220
4 Conclusion221
In this project, we implemented an artistic style learning algorithm and also improve the222
algorithm’s performance by replacing the ReLU activation function in the original VGG223
convolution neural network with the Softplus activation function.224
5 Contribution225
Setting up platform: Zixuan Wang.226
Derived the forward and backward propagation formula: Bowen Sun227
Wrote and implemented prototype derivative function: Bowen Sun.228
Improved derivative function: Zixuan Wang.229
Reference230
[1] Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "A neural algorithm of artistic style."231
arXiv preprint arXiv:1508.06576 (2015).232
[2] Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J].233
International Journal of Computer Vision, 2014: 1-42.234
[3] Gatys, L. A., Ecker, A. S. & Bethge, M. "Texture synthesis and the controlled generation of natural235
stimuli using convolutional neural networks. " arXiv:1505.07376 [cs, q-bio] (2015).236
[4] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale237
image recognition." arXiv preprint arXiv:1409.1556 (2014).238

Más contenido relacionado

La actualidad más candente

Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 

La actualidad más candente (20)

A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATIONA DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
 
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
 
Segmentation by Fusion of Self-Adaptive SFCM Cluster in Multi-Color Space Com...
Segmentation by Fusion of Self-Adaptive SFCM Cluster in Multi-Color Space Com...Segmentation by Fusion of Self-Adaptive SFCM Cluster in Multi-Color Space Com...
Segmentation by Fusion of Self-Adaptive SFCM Cluster in Multi-Color Space Com...
 
Image Enhancement Using Filter To Adjust Dynamic Range of Pixels
Image Enhancement Using Filter To Adjust Dynamic Range of PixelsImage Enhancement Using Filter To Adjust Dynamic Range of Pixels
Image Enhancement Using Filter To Adjust Dynamic Range of Pixels
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION
 
Image Acquisition and Representation
Image Acquisition and RepresentationImage Acquisition and Representation
Image Acquisition and Representation
 
6. 7772 8117-1-pb
6. 7772 8117-1-pb6. 7772 8117-1-pb
6. 7772 8117-1-pb
 
Fractal Image Compression of Satellite Color Imageries Using Variable Size of...
Fractal Image Compression of Satellite Color Imageries Using Variable Size of...Fractal Image Compression of Satellite Color Imageries Using Variable Size of...
Fractal Image Compression of Satellite Color Imageries Using Variable Size of...
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
LAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSING
LAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSINGLAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSING
LAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSING
 
Object Shape Representation by Kernel Density Feature Points Estimator
Object Shape Representation by Kernel Density Feature Points Estimator Object Shape Representation by Kernel Density Feature Points Estimator
Object Shape Representation by Kernel Density Feature Points Estimator
 
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIPVARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
VARIATION-FREE WATERMARKING TECHNIQUE BASED ON SCALE RELATIONSHIP
 
DCT
DCTDCT
DCT
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR Optimization
 
Walsh, Sine, Haar & Cosine Transform With Various Color Spaces for ‘Color to ...
Walsh, Sine, Haar & Cosine Transform With Various Color Spaces for ‘Color to ...Walsh, Sine, Haar & Cosine Transform With Various Color Spaces for ‘Color to ...
Walsh, Sine, Haar & Cosine Transform With Various Color Spaces for ‘Color to ...
 
Color to Gray and back’ using normalization of color components with Cosine, ...
Color to Gray and back’ using normalization of color components with Cosine, ...Color to Gray and back’ using normalization of color components with Cosine, ...
Color to Gray and back’ using normalization of color components with Cosine, ...
 
Lect 03 - first portion
Lect 03 - first portionLect 03 - first portion
Lect 03 - first portion
 
Reversible color Image Watermarking
Reversible color Image WatermarkingReversible color Image Watermarking
Reversible color Image Watermarking
 
Multi-Level Coding Efficiency with Improved Quality for Image Compression bas...
Multi-Level Coding Efficiency with Improved Quality for Image Compression bas...Multi-Level Coding Efficiency with Improved Quality for Image Compression bas...
Multi-Level Coding Efficiency with Improved Quality for Image Compression bas...
 

Similar a nips report

20150703.journal club
20150703.journal club20150703.journal club
20150703.journal club
Hayaru SHOUNO
 
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNNAutomatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Zihao(Gerald) Zhang
 
Building 3D Morphable Models from 2D Images
Building 3D Morphable Models from 2D ImagesBuilding 3D Morphable Models from 2D Images
Building 3D Morphable Models from 2D Images
Shanglin Yang
 

Similar a nips report (20)

SinGAN for Image Denoising
SinGAN for Image DenoisingSinGAN for Image Denoising
SinGAN for Image Denoising
 
Single Image Depth Estimation using frequency domain analysis and Deep learning
Single Image Depth Estimation using frequency domain analysis and Deep learningSingle Image Depth Estimation using frequency domain analysis and Deep learning
Single Image Depth Estimation using frequency domain analysis and Deep learning
 
Image Classification using Deep Learning
Image Classification using Deep LearningImage Classification using Deep Learning
Image Classification using Deep Learning
 
Semantic image completion and enhancement using Deep Learning
Semantic image completion and enhancement using Deep LearningSemantic image completion and enhancement using Deep Learning
Semantic image completion and enhancement using Deep Learning
 
Implementing Neural Style Transfer
Implementing Neural Style Transfer Implementing Neural Style Transfer
Implementing Neural Style Transfer
 
Human Head Counting and Detection using Convnets
Human Head Counting and Detection using ConvnetsHuman Head Counting and Detection using Convnets
Human Head Counting and Detection using Convnets
 
20150703.journal club
20150703.journal club20150703.journal club
20150703.journal club
 
Method for a Simple Encryption of Images Based on the Chaotic Map of Bernoulli
Method for a Simple Encryption of Images Based on the Chaotic Map of BernoulliMethod for a Simple Encryption of Images Based on the Chaotic Map of Bernoulli
Method for a Simple Encryption of Images Based on the Chaotic Map of Bernoulli
 
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLIMETHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
 
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLIMETHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
METHOD FOR A SIMPLE ENCRYPTION OF IMAGES BASED ON THE CHAOTIC MAP OF BERNOULLI
 
APPLIED MACHINE LEARNING
APPLIED MACHINE LEARNINGAPPLIED MACHINE LEARNING
APPLIED MACHINE LEARNING
 
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNNAutomatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
 
CNN and its applications by ketaki
CNN and its applications by ketakiCNN and its applications by ketaki
CNN and its applications by ketaki
 
Vector-Based Back Propagation Algorithm of.pdf
Vector-Based Back Propagation Algorithm of.pdfVector-Based Back Propagation Algorithm of.pdf
Vector-Based Back Propagation Algorithm of.pdf
 
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
 
I3602061067
I3602061067I3602061067
I3602061067
 
Building 3D Morphable Models from 2D Images
Building 3D Morphable Models from 2D ImagesBuilding 3D Morphable Models from 2D Images
Building 3D Morphable Models from 2D Images
 
Waste Classification System using Convolutional Neural Networks.pptx
Waste Classification System using Convolutional Neural Networks.pptxWaste Classification System using Convolutional Neural Networks.pptx
Waste Classification System using Convolutional Neural Networks.pptx
 
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
 
Decomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesisDecomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesis
 

nips report

  • 1. Artistic Style Learning 1 2 Bowen Sun Zixuan Wang3 Department of Computing Science Department of Computing Science4 Simon Fraser University Simon Fraser University5 Burnaby, B.C. Canada. V5A 1S6 zwa72@sfu.ca6 Bsa58@sfu.ca7 8 ABSTRACT9 In our project, we implemented a neural network approach about artistic10 style learning. Aside from implementation, we also improved the algorithm11 by replacing some part of the algorithm. This algorithm is mainly based on12 the VGG convolution neural network and speeded up by using the NVidia13 CUDA computing platform.14 1 Introduction15 Artistic style is a very interesting area in neuroscience and how human brain perceive one16 paint’s content and style is still unknown. However, in the recent paper “A Neural Algorithm17 of Artistic Style” [1], the neuroscience researchers gave us a pass way to understand how18 human brain do this job using VGG convolution neural network [2]. Their intuition can be19 divided into two part: 1) content reconstruction; 2) style representation. For content20 reconstruction, they defined a loss function which depended on the output of the convolution21 layers. For style representation, at some of the convolution layers, they also defined a Gram22 matrix which represent the correlation between different filters at the corresponding layer23 and also defined the loss function for these Gram matrix [3]. By mixing these two loss24 functions and calculating the derivative to the image they were going to generate, they can25 use the stochastic gradient descent to iteratively improve the image and generate the final26 result. In our project, we mainly implemented the algorithm in this paper and modified some27 part of it. This report is divided into four parts. The first part is the introduction to the artistic28 style algorithm, VGG neural network and NVidia CUDA platform as well as our29 implementation detail. The second part is our experiments result. The third part is the30 experiment result. The fourth part is our project’s conclusion. And the final part is our team31 members’ contributions.32 2 Approach33 In this section, the detailed approach of this project will be introduced. The results were34 generated with the help of a 19-layer VGG network, which is a convolutional neural network35 having great performance on object recognition task. And we utilized the output of the 1336 convolutional layers and 4 pooling layers to define loss functions for content as well as the37 style, which were the target error functions in gradient descent for the content reconstruction38 and style representation. For the image synthesis, we replace the original rectified non-linear39 activation function with the softplus function, since it has the similar output as rectified40 non-linearity yet more smooth derivative flow. And according to [1], using average pooling41 instead of max pooling usually has better result in generation of new pictures.42 For the recombination of the content and style, we used an error function defined as a linear43 combination of the loss of content and style. As can be seen in section 4, using gradient44 descent again we generated different mixture effect with different ratio of the coefficients for45 content and style loss functions.46 2.1 VGG 19-layer network47 48
  • 2. The network we used for this project is the VGG 19-layer network, a convolutional neural network49 rivals the human perfiormance on object recognition in ImageNet challenge [2], which is50 extensively introduced in [3]. The original network contains 16 convolutional layers after each of51 which there will be a rectification non-linear activation function, 5 max pooling layers, 3 fully52 connected layers and 1 softmax layer. And the information of the padding, weights value for filter53 banks, the size and stride for filter and pools can be found online. The layers our project using can54 be shown as follow:55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 Figure 170 Since the method used to reconstruct the picture from the output of the high level convolutional71 layer is gradient descent, the smoothness of the derivative flow is quite important. As can be seen72 from the following graph, the softplus function has similar output as the rectification function, yet73 more smooth derivative. The difference between ReLU and Softplus is as follow:74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 Figure 291 For the same reason, the max pooling layers are replaced with the average pooling layers with the92 same size and stride of the pool as the work in [1].93 2.2 Content reconstruction94 Generally, the output of the filter banks for each convolutional layer can be seen as the filter95 response of a given input picture. In this case, the output of the filter banks is called the feature96 maps, each feature map is actually a 2-D matrix encoding the output of the layer before it on a97 given filter. The number of the feature map is equal to the number of filter in each layer. To avoid98 the big confusing crowd of the subscripts, suppose we reshape the 2-D feature map to a 1-D99 vector. Layer 𝒍 with 𝑵𝒍 different filters has 𝑵𝒍 different feature maps or vectors of size 𝑴𝒍,100 𝑴𝒍 is the product of the width and height of the feature map, also, the length of the reshaped101 feature vector. So the response of a convolutional layer 𝒍 can be stored in a matrix 𝑭𝒍 ∈ 𝑹 𝑵 𝒍×𝑴 𝒍,102 where 𝑭𝒊𝒋 𝒍 is value at position 𝒋 of the feature vector generated by filter 𝒊 at layer 𝒍. Let 𝒑⃗⃗ and103 𝑷𝒍 denote the picture and the activation of layer 𝒍, 𝒙⃗⃗ denotes the picture on which we want to104 Input image Conv 1_1 Conv 1_2 Average pool1 Conv 2_1 Conv 2_2 Average pool2 Conv 3_1 Conv 3_2 Conv 3_3 Conv 3_4 Average pool3 Conv 4_1 Conv 4_2 Conv 4_3 Conv 4_4 Average pool4 Conv 5_1
  • 3. reconstruct the content in 𝒑⃗⃗ , the loss function of the content at layer 𝒍 is defined as follows:105 𝑳 𝒄𝒐𝒏𝒕𝒆𝒏𝒕(𝒑⃗⃗ , 𝒙⃗⃗ , 𝒍) = 𝟏 𝟐 ∑(𝑭𝒊𝒋 𝒍 − 𝑷𝒊𝒋 𝒍 ) 𝟐 𝒊,𝒋 106 And the corresponding derivative is the following:107 𝝏𝑳 𝒄𝒐𝒏𝒕𝒆𝒏𝒕 𝝏𝑭𝒊𝒋 𝒍 = (𝑭𝒍 − 𝑷𝒍 )𝒊𝒋 108 Then after taking derivative from layer 𝒍 to the input layer, we got the gradient of filter responses109 with respect to the input picture 𝒙⃗⃗ , with which we can perform the gradient descent to train the110 picture 𝒙⃗⃗ as to have the same content as 𝒑⃗⃗ .111 2.3 Style representation112 To obtain the information of style, we used a feature space designed to capture the texture113 information which is introduced in [4]. So what we have done here is to use texture transfer to114 achieve the style representation. To get the correlation of different filter outputs in the same layer,115 we compute the gram matrix 𝑮 where 𝑮𝒊𝒋 𝒍 is the inner product of different reshaped feature116 vectors.117 𝑮𝒊𝒋 𝒍 = ∑ 𝑭𝒊𝒌 𝒍 𝑭𝒋𝒌 𝒍 𝒌 118 Let 𝒂⃗⃗ and 𝑨𝒍 denote the picture, the style of which we want to extract, and 𝑬𝒍 denotes the error119 of style at layer 𝒍 . Then the loss function of style and the derivative of 𝑬𝒍 with respect to the120 activation of layer 𝒍 can be defined as following:121 𝑳 𝒔𝒕𝒚𝒍𝒆(𝒂⃗⃗ , 𝒙⃗⃗ ) = ∑ 𝒘𝒍 𝑬𝒍 𝑳 𝒍=𝟎 122 𝑬𝒍 = 𝟏 𝟒𝑵𝒍 𝟐 𝑴𝒍 𝟐 ∑(𝑮𝒊𝒋 𝒍 − 𝑨𝒊𝒋 𝒍 ) 𝟐 123 𝝏𝑬𝒍 𝝏𝑭𝒊𝒋 𝒍 = 𝟏 𝑵𝒍 𝟐 𝑴𝒍 𝟐 ((𝑭𝒍 ) 𝑻 (𝑮𝒍 − 𝑨𝒍 )) 𝒋𝒊 124 For style representation, we used a linear combination of the errors on different layers to improve125 the result, same as [4].126 2.4 Recombination of the content and style127 To generate a new picture which combines the content of picture 𝒑⃗⃗ and the style of painting 𝒂⃗⃗ ,128 we simply used gradient descent to minimize the total loss function defined as following:129 𝒍𝒕𝒐𝒕𝒂𝒍(𝒑⃗⃗ , 𝒂⃗⃗ , 𝒙⃗⃗ ) = 𝜶𝒍 𝒄𝒐𝒏𝒕𝒆𝒏𝒕(𝒑⃗⃗ , 𝒙⃗⃗ ) + 𝜷𝒍 𝒔𝒕𝒚𝒍𝒆(𝒂⃗⃗ , 𝒙⃗⃗ )130 The ratio of the coefficient 𝜶 and 𝜷 here can control the mixture effect, as can be foreseen, the131 generated picture will be more inclined to the style with larger ratio of 𝛽 𝛼⁄ .132 2.5 Development Environment133 In this project, we mainly used a toolbox of Matlab called Matconvnet, implementing the134 computer version of convolutional neural network. Besides, the most time consuming operation in135 this project is the computation of the derivatives in the gradient descent. So we installed the Cuda136 which is a parallel platform invented by NVIDIA, dramatically improved the computing137 performance by utilization of the power of GPU. The version of the software and tool we used in138 this project is the following:139 Matlab2014a140 Cuda6.5141 Matconvnet1.0-beta16142 Visual studio 2013143 3 Experiment144 145
  • 4. In this section, the detailed information of experiment and the result will be introduced. Compared146 to the result generated from [1], the effect of mixture is slightly better using softplus function in147 the place of rectification function. Due to the step type of derivative of rectified function, the148 picture generated in the same way as [1] contains lots of white holes. Since we used the smoother149 softplus function, the generated picture only contains small waves on the picture.150 3.1 Content Reconstruction151 When Convolutional Neural Networks are trained on object recognition, they develop a152 representation of the image that makes object information increasingly explicit along the153 processing hierarchy [3]. Therefore, the input image is transformed into filter responses that154 increasingly care about the actual content of the image compared to its detailed pixel values.155 For the experiment of content reconstruction, we used the following picture and try to reconstruct156 its content using the loss function on the error of (a) Conv1_1 (b) Conv2_1 (c) Conv3_1 (d)157 Conv4_1 (e) Conv5_1. Our experiment result is as follow:158 159 160 161 162 163 164 Photo (a) (b)165 (c) (d) (e)166 Figure 3: photo reconstruction. a) Conv1_1; b) Conv2_1; c) Conv3_1; d) Conv4_1; e) Conv5_1.167 As we can see from the result, reconstruction of the image using output from higher level mainly168 captures the arrangement of the object and the contour of items in the picture instead of the exact169 pixel value compared to the reconstruction using output from lower level.170 3.2 Style Representation171 For style representation, we used the feature space computing the correlation of different filter172 responses, which produced the texturized picture of the painting capturing the usage and localized173 structure of color instead of the content.174 For the experiment of style representation, we used the famous artwork The Starry Night of Van175 Gogh, and the result of style representation using gradient on the loss function of (a) Conv1_1 (b)176 Conv1_1, Conv2_1 (c) Conv1_1, Conv2_1, Conv3_1 (d) Conv1_1, Conv2_1, Conv3_1, Conv4_1177 (e) Conv1_1, Conv2_1, Conv3_1, Conv4_1, Conv5_1. The weights for errors of different layers178 are all 1 divided by the number of layers counting in the total loss function. Our experiment result179 is as follow:180 181 182 183 184 185 186 Style (a) (b) (c)187 (d) (e)188 Figure 4: style representation. a) Conv1_1; b) Conv2_1; c) Conv3_1; d) Conv4_1; e) Conv5_1.189 3.3 Recombination of the content and style190 However, the content and style can’t be completely separated, there is usually no picture can191 perfectly match the content and style from different sources. For the recombination of the content192 and style from different picture, we used the gradient descent to minimize the joint loss function193 defined on well separated loss function on content and style respectively. Therefore, we can194 regulate the emphasis on the content or style with different ratio of 𝛽 𝛼⁄ . In the experiment, we195 tried to render a photo taken at SFU Burnaby campus with the style of (a) Der Schrei, (b) The196 shipwreck of minotaur, (c) High mountains with the ratio of 100, 1000, and 10000.197
  • 5. 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 Photo Style 10218 1000 10000219 Figure 5: experiment result with different styles and different ratios220 4 Conclusion221 In this project, we implemented an artistic style learning algorithm and also improve the222 algorithm’s performance by replacing the ReLU activation function in the original VGG223 convolution neural network with the Softplus activation function.224 5 Contribution225 Setting up platform: Zixuan Wang.226 Derived the forward and backward propagation formula: Bowen Sun227 Wrote and implemented prototype derivative function: Bowen Sun.228 Improved derivative function: Zixuan Wang.229 Reference230 [1] Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. "A neural algorithm of artistic style."231 arXiv preprint arXiv:1508.06576 (2015).232 [2] Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J].233 International Journal of Computer Vision, 2014: 1-42.234 [3] Gatys, L. A., Ecker, A. S. & Bethge, M. "Texture synthesis and the controlled generation of natural235 stimuli using convolutional neural networks. " arXiv:1505.07376 [cs, q-bio] (2015).236 [4] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale237 image recognition." arXiv preprint arXiv:1409.1556 (2014).238