Texture synthesis and artistic style transfer

In this homework you are to imlement A Neural algorithm of artistic style. This is an extension of Texture Synthesis Using Convolutional Neural Networks method.

The core of the method -- VGG and constrained optimization. The constrains are of two types: content and style. Given a content image C and style image S we want to generate an image X with content from C and style (whatever it really means) from S.

We want to design a loss function for the optimization process. Considering [1], [2], an input image is easily invertable from the outputs at intermediate layers. This explains the idea of making an intermediate representation $F_X$ of X close to C representation $F_C$.

$$ L_{content} = || F_X - F_C || \rightarrow \min_X $$

Note, that representation $F$ preserve spatial information. Idea: let us dismiss it, so we will know what objects are there on the picture, but will not be able to reestablish their localtion. The style can be thought as something independent of content, something we are left with if we let the content off. L. Gatys suggests to dismiss spatial information by computing correlations between the feature maps $F$. If $F$ has dimensions CxWxH, then correlation matrix will be CxC, and look there's no spatial dimentions. So the style term will be responsible for mathing these correlation (Gram) matrices.

$$ L_{style} = || Gram(F_X) - Gram(F_C) || \rightarrow \min_X $$

And finaly we combine the two.

$$ L = \alpha L_{content} + \beta L_{style} \min_X $$

Read the paper and the code for the details on layers, features $F$ are got from.

A little bit of history behind this texture generation method

Actually the idea comes from 90th, when mathematical models of texures were developed [3]. They defined a probabolistic model for texture generation. They used an idea, that two images are indeed two samples of a particular texture iff their statistics match. The statistics used are histograms of given texture $I$ filtered with a number of filters: $\{hist(F_i * I), \quad i = 1,\dots, k\}$. And whatever image has the same statistics is thought as a sample of texture $I$. The main drawback was the Gibbs sampling was employed (which is very slow). [4] suggested exactly the scheme we use now: starting from a random image, let's adjust its statistics iteratively so they match the desired.

Now, what is changed: the filters. [4] used carefully crafted set of filters, and now we use neural network based non-linear filters. We still use the idea of matching statistics, but the statistics improved.

[1] A.Mahendran, A.Vedaldi Understanding Deep Image Representations by Inverting Them

[2] A.Dosovitsky, T.Brox Inverting Visual Representations with Convolutional Networks

[3] Zhu et. al. Filters, 1997 Random Fields and Maximum Entropy (FRAME): Towards a Unified Theory for Texture Modeling

[4] Portilla & Simoncelli, 2000 A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients


To prevent you from technical problems, you may use a complete code for the method. Your task will be to play around with it.

First part

Common mandatory part:

  • Generate your favourite texture (please, do not use starry night). All you need to do is to set content weight to 0.
  • Stylize your favourite photo with your favourite style (hope you use something interesting).
  • Give an explanation for matching Gram matrices. What does it mean to minimize distance between them in terms of random variables? Assume a true distripution $P$, and model distibution $Q$. What class does $Q$ belong when matching gram matrices? Show, that $KL (P || Q)$ is minimized when Gram matrices are matched. In other words you need to come up with $Q$ such that $KL$ divergence is minimized when models gram matrix is equal to a target Gram matix. If you do not understand the question spend more time, please. If you want a hint after all, here is a Telegram bot for you (send /hint to him).

Second part

We give you two options for the second part.

First one (if you are lazy or do not have GPU do just this):

  • Implement Mean and Covariance matching functions instead of $Gram$ matching. That is:
    • Mean is a vector of size C which containes means over feature maps
    • Covariance matrix is a Gram matrix of $Feats-mean$
  • What is $Q$ now?
  • Generate texture and stylize with $mean$ loss only; with $mean$ + $Covariance$ loss. Plot the results, side by side (3 textures and 3 stylized). What do you think? Actually, $Gram$ matrix or $Mean$ or $Mean$ + $Covariance$ matrix can be thought as texture descriptors. Does $mean$ encoding have enough parameters to represent texures?
  • OR come up with your method to remove spatial information instead of above.
  • Bonus: you can mix several styles, averaging their representations. It can be fun. Some examples are here.

Second one (hardcore):

  • Substitute gram matrices with discriminator as in GAN. That is, you match distributions matching gram matrices and discriminator is designed to match distributions. Probably $Q$, we have defined is weak or too constraintive. Neural network based discriminator should be more flexible in this sense.
    • The procedure will be a little bit unusual: we will optimize NN inside optimization loop w.r.t. image.
    • You need to define a pixel level discriminator (at each layer you have $WH$ objects, each with $C$ features). Basically it should decide whether a pixel came from style image or from current image $X$.
    • So the process is like that:
      • At each image optimization iteration update D (actually you do not need to do minibatches updates here, you can simulate fully-connected layers with 1x1 convolutions, softmax with sigmoids). You will need to find a trade-off, for how long and how frequent should updates be.
      • Then propagate gradient just like in GAN when optimizing $G$ i.e. swap labels (another strategy is in here, eq. 4).
      • Let L-BFGS (or whatever, probably adam will be more stable) update $X$.
    • Discriminator architecture is up to you. It's better to start with logistic regression which should emulate $Mean$ + $Cov$ matching (isn't it?).
    • I tried this myself without content loss only.

Do everything in this notebook, I need your code as well as the generated images


  • In case you do not have GPU, you need to substitute the line:

    from lasagne.layers.dnn import Conv2DDNNLayer as ConvLayer


    from lasagne.layers import Conv2DLayer as ConvLayer

  • If you do not have GPU, resize your images to 256x256 and no more. Even at this resolution it may take an hour. You can decrease the number of iterations if it takes too long.