Notebooks
M
Microsoft
StyleTransfer

StyleTransfer

artificial-intelligence10-GANsrnnganmicrosoft-for-beginnerslessonsAImicrosoft-AI-For-Beginnersmachine-learningdeep-learning4-ComputerVisioncomputer-visioncnnNLP

Style Transfer

An example below is inspired by original tutorial on TensorFlow, as well as by this blog post. Another good example of Style Transfer using CNTK framework is here. Here is an original paper on Artistic Style Transfer.

Main ideas behind style transfer are the following:

  • Starting from white noise, we try to optimize the current image xx to minimize some loss function
  • Loss function consists of three components L(x)=αLc(x,i)+βLs(x,s)+γLt(x)\mathcal{L(x)} = \alpha\mathcal{L}_c(x,i) + \beta\mathcal{L}_s(x,s)+\gamma\mathcal{L}_t(x)
    • Lc\mathcal{L}_c - content loss - shows how close the current image xx is to original image ii
    • Ls\mathcal{L}_s - style loss - shows how close the current image xx is to style image ss
    • Lt\mathcal{L}_t - total variation loss (we will not consider it in our example) - makes sure that the resulting image is smooth, i.e. it shows the mean squared error of neighbouring pixels of the image xx

Those loss functions have to be designed in a clever way, so that for example style loss corresponds to styles of the images being similar, and not the actual content. For that, we will compare some deeper feature layers of a CNN which looks at the image.

Let's start by loading a couple of images:

[1]
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  210k  100  210k    0     0  2670k      0 --:--:-- --:--:-- --:--:-- 2670k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  131k  100  131k    0     0   459k      0 --:--:-- --:--:-- --:--:--  459k
[30]

Let's load those images and resize them to 512×512512\times512. Also, we will generate resulting image img_result as a random array.

[31]
Output

To calculate style loss and content loss, we need to work in the feature space extracted by a CNN. We can use different CNN architectures, but for simplicity in our case we will chose VGG-19, pre-trained on ImageNet.

[32]

Let's have a look at the model architecture:

[33]
Model: "vgg16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_3 (InputLayer)        [(None, None, None, 3)]   0         
                                                                 
 block1_conv1 (Conv2D)       (None, None, None, 64)    1792      
                                                                 
 block1_conv2 (Conv2D)       (None, None, None, 64)    36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, None, None, 64)    0         
                                                                 
 block2_conv1 (Conv2D)       (None, None, None, 128)   73856     
                                                                 
 block2_conv2 (Conv2D)       (None, None, None, 128)   147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, None, None, 128)   0         
                                                                 
 block3_conv1 (Conv2D)       (None, None, None, 256)   295168    
                                                                 
 block3_conv2 (Conv2D)       (None, None, None, 256)   590080    
                                                                 
 block3_conv3 (Conv2D)       (None, None, None, 256)   590080    
                                                                 
 block3_pool (MaxPooling2D)  (None, None, None, 256)   0         
                                                                 
 block4_conv1 (Conv2D)       (None, None, None, 512)   1180160   
                                                                 
 block4_conv2 (Conv2D)       (None, None, None, 512)   2359808   
                                                                 
 block4_conv3 (Conv2D)       (None, None, None, 512)   2359808   
                                                                 
 block4_pool (MaxPooling2D)  (None, None, None, 512)   0         
                                                                 
 block5_conv1 (Conv2D)       (None, None, None, 512)   2359808   
                                                                 
 block5_conv2 (Conv2D)       (None, None, None, 512)   2359808   
                                                                 
 block5_conv3 (Conv2D)       (None, None, None, 512)   2359808   
                                                                 
 block5_pool (MaxPooling2D)  (None, None, None, 512)   0         
                                                                 
=================================================================
Total params: 14,714,688
Trainable params: 0
Non-trainable params: 14,714,688
_________________________________________________________________

Let's define a function that will allow us to extract intermediate features from VGG network:

[34]

Content Loss

Content loss will show how close our current image xx is to the original image. It looks at the intermediate feature layers in CNN, and computes square error. Content loss on layer ll will be defined as

Lc=12i,j(Fij(l)Pij(l))2\mathcal{L}_c = {1\over2}\sum_{i,j} (F_{ij}^{(l)}-P_{ij}^{(l)})^2

where F(l)F^{(l)} and P(l)P^{(l)} -- features at layer ll.

[35]

Now we will implement the main trick of style transfer - optimization. We will start with random image, and then use TensorFlow optimizer to adjust this image to minimize content loss.

Important: in our case, all computations are performed using GPU-aware TensorFlow framework, which allows this code to run much more efficiently on GPU.

[63]
Output

Exercise: Try experimenting with different layers in the network and see what happens. You can also try optimizing for several layers together, but you would have to change the code for content_loss a bit.

Style Loss

Style loss is the main idea behind Style Transfer. We compare not the actual features, but their Gram matrices, which are defined as G=A×ATG=A\times A^T

Gram matrix is similar to correlation matrix, and it shows how some filters depend on the others. Style Loss is computed as a sum of losses from different layers, which are often considered with weighted coefficients.

Total loss function for style transfer is a sum of content loss and style loss.

[64]

Putting it all together

We will define total_loss function that will calculate combined loss, and run the optimization:

[65]
Output

The code below performs the actual optimization of loss. Keep in mind that even with GPU the optimization takes significant amount of time. You can run the cell below several times to improve the result.

Add variation loss

Variation loss allows us to make the image less noisy, by minimizing the amount of difference between neighbouring pixels.

We will also start optimization from the original content image, which allows us to keep more content details in the image, without complicating content loss function. We will add some noise, though.

[95]
Output
[94]
(256, 256, 3)
True
[ ]