In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Neural Style Transfer With Eager Execution & Keras

Notebook originally contributed by: github.com/raskutti

View source on GitHub

Overview

This notebook demonstrates neutral style transfer, a technique used to display an image in the style of a different image. This algorithm is outlined in more detail in this paper. The material here is heavily based on the awesome work in this article by Raymond Yuan and this notebook by Francois Chollet.

Neural style transfer is an optimization technique that can be applied to produce a generated image that conveys the content of a content image through the style of a style image. Content images are generally object-specific, for example a portrait, while style images are generally background, for example scenery.

As with all deep learning algorithms, neural style transfer defines a loss function then minimizes it. Suppose we have a function $C$ to measure content and a function $S$ to measure style, as well as measures of distance between of two images $x$ and $y$ for content and style, denoted by $L_c(x, y)$ and $L_s(x, y)$ respectively. Then the loss function $L$ is as below, where $k$ is the content image, $m$ is the style image, and $n$ is the generated output image (the variable to minimize over).

$$ L(n) = L_c(C(k), C(n)) + L_s(S(m), S(n)) $$

The example below takes a graffiti drawing of Eminem as the content image and a Julia Set fractal as the style image. The generated image conveys the same work of Eminem through the style of the fractal.

Original image by geishaboy500

Download Images


In [0]:
import tensorflow as tf
assert tf.__version__.startswith('2')
import numpy as np
import matplotlib.pyplot as plt

In [0]:
from tensorflow.keras.preprocessing.image import load_img, img_to_array

In [0]:
!wget --quiet -O 'eminem.jpg' https://upload.wikimedia.org/wikipedia/commons/f/f1/Southsea_Skatepark_Graff_%287%29_%283874828505%29.jpg
!wget --quiet -O 'fractal.jpg' https://upload.wikimedia.org/wikipedia/commons/1/17/Julia_set_%28highres_01%29.jpg

!ls

We can now display the content image and the style image side by side.


In [0]:
plt.figure(figsize = (12, 6))

plt.subplot(1, 2, 1)
plt.imshow(load_img('eminem.jpg'))
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow(load_img('fractal.jpg'))
plt.axis('off')

plt.show()

Processing Images


In [0]:
from tensorflow.keras.applications import vgg19

Let's create methods that will allow us to load and preprocess our images easily. We perform the same preprocessing process as are expected according to the VGG training process. VGG networks are trained on image with each channel normalized by mean = [103.939, 116.779, 123.68]and with channels BGR.


In [0]:
def preprocess_img(img_path):
  
  # Set the proportions of the image.
  
  width, height = load_img(img_path).size
  img_height = 500
  img_width = int(width * img_height / height)
  
  img = load_img(img_path, target_size=(img_height, img_width))
  img = img_to_array(img)
  img = np.expand_dims(img, axis=0)
  img = vgg19.preprocess_input(img)
  
  return img

In order to view the outputs of our optimization, we are required to perform the inverse preprocessing step. Furthermore, since our optimized image may take its values anywhere between $- \infty$ and $\infty$, we must clip to maintain our values from within the 0-255 range.


In [0]:
def deprocess_img(processed_img):

  x = processed_img.copy()
  
  if len(x.shape) == 4:
    x = np.squeeze(x, 0)

#   if len(x.shape) != 3:
#     raise ValueError('Invalid input to deprocessing image')
  
  # Perform the inverse of the preprocessing step.
  x[:, :, 0] += 103.939
  x[:, :, 1] += 116.779
  x[:, :, 2] += 123.68
  x = x[:, :, ::-1]

  x = np.clip(x, 0, 255).astype('uint8')
  return x

Define Content & Style Layers

In order to get both the content and style representations of our image, we will look at some intermediate layers within our model. As we go deeper into the model, these intermediate layers represent higher and higher order features. In this case, we are using the network architecture VGG19, a pretrained image classification network. These intermediate layers are necessary to define the representation of content and style from our images. For an input image, we will try to match the corresponding style and content target representations at these intermediate layers.

So, why do these intermediate outputs within our pretrained image classification network allow us to define style and content representations? At a high level, this phenomenon can be explained by the fact that in order for a network to perform image classification (which our network has been trained to do), it must understand the image. This involves taking the raw image as input pixels and building an internal representation via functions that turn the raw image pixels into an understanding of the features present within the image.

This is also partly why convolutional neural networks are able to generalize well. They are able to capture the invariances and defining features within classes (e.g. cats vs dogs) that are agnostic to background noise and other nuisances. Therefore, somewhere between where the raw image is input and the classification label is output, the model serves as a complex feature extractor. So by accessing intermediate layers, we’re able to describe the content and style of input images.

We’ll pull the following intermediate layers from our network.


In [0]:
content_layers = [
    'block5_conv2',
]

num_content_layers = len(content_layers)

style_layers = [
    'block1_conv1',
    'block2_conv1',
    'block3_conv1', 
    'block4_conv1', 
    'block5_conv1',
]

num_style_layers = len(style_layers)

Build The Model


In [0]:
from tensorflow.python.keras import models
import tensorflow.contrib.eager as tfe

VGG19 is a relatively simple model (compared with ResNet, Inception, etc.), and the feature maps tend to work better for style transfer. We feed in our input tensor to the model, then extract the feature maps (and subsequently the content and style representations) of the content, style, and generated images.

In order to access the intermediate layers corresponding to our style and content feature maps, we use the Keras functional API**. With this API, defining a model simply involves defining the input and output, i.e. model = Model(inputs, outputs).


In [0]:
def get_model():
  """Creates a model with access to intermediate layers. 
  
  These layers will then be used to create a new model that will take the
  content image and return the outputs from these intermediate layers from the
  VGG model. 
  
  Returns:
    A keras model that takes image inputs and outputs the style and content
    intermediate layers.
  """

  vgg = vgg19.VGG19(include_top=False, weights='imagenet')
  vgg.trainable = False
 
  style_outputs = [vgg.get_layer(name).output for name in style_layers]
  content_outputs = [vgg.get_layer(name).output for name in content_layers]
  model_outputs = style_outputs + content_outputs

  return models.Model(vgg.input, model_outputs)

Define Content & Style Loss Functions

Content Loss

The function that defines content loss will take both the desired content image and our base input image. These images are passed to the network, and will return the intermediate layer outputs from our model. Then, the loss simply takes the Euclidean distance between the two intermediate representations of those images.

More formally, let $N$ be a pre-trained deep convolutional neural network. Let $X$ be any image, then $N(X)$ is the network fed by $X$. Let $A^l(k) \in N(k)$ and $B^l(n) \in N(n)$ describe the respective intermediate feature representation of the network with inputs $k$ and $n$ at layer $l$. Then we describe the content loss $L_c^l$ formally as below.

$$L_c^l(k, n) = \sum_N (A^l(k) - B^l(n))^2$$

We can use backpropogration to minimize content loss, thus changing the initial image until it generates a similar response in a given layer as the original content image.


In [0]:
def compute_content_loss(base_content, target):
  return tf.reduce_mean(tf.square(base_content - target))

Style Loss

Computing style loss is a bit more involved, but follows the same principle, this time feeding our network the base input image and the style image. However, instead of comparing the raw intermediate outputs of the base input image and the style image, we instead compare the Gram matrices of the two outputs.

Mathematically, we describe the style representation of an image as the correlation between different filter responses given by the Gram matrix $G^l$, where $G^l_{ij}$ is the inner product (and represents the correlation) between the vectorized feature map $i$ and $j$ in layer $l$.

To generate a style for our base input image, we perform gradient descent from the content image to transform it into an image that matches the style representation of the original image. We do so by minimizing the mean squared distance between the feature correlation map of the style image and the input image. The contribution $E_l$ of each layer $l$ to the total style loss is described by

$$E_l(m, n) = \frac{1}{4C_l^2D_l^2} \sum_{i,j}(G^l_{ij}(m) - G^l_{ij}(n))^2$$

where $C_l$ is the number of feature maps, each of size $D_l = \textrm{height} \cdot \textrm{width}$.

Thus, the total style loss $L_s$ across each layer $l$ is

$$L_s(m, n) = \sum_l w_l E_l(m, n)$$

where we weight the contribution of each layer's loss by some factor $w_l$. In our case, we weight each layer equally, so $w_l = w \ \forall \ l$.


In [0]:
def gram_matrix(input_tensor):

  channels = int(input_tensor.shape[-1])
  a = tf.reshape(input_tensor, [-1, channels])
  n = tf.shape(a)[0]
  gram = tf.matmul(a, a, transpose_a=True)
  return gram / tf.cast(n, tf.float32)

def compute_style_loss(base_style, gram_target):

  height, width, channels = base_style.get_shape().as_list()
  gram_style = gram_matrix(base_style)
  
  return tf.reduce_mean(tf.square(gram_style - gram_target))

Apply Neural Style Transfer

We use the Adam optimizer in order to minimize our loss. We iteratively update our output image such that it minimizes our loss. In order to do this, we must know how we calculate our loss and gradients. The L-BFGS optimization algorithm is recommended but is not used in this tutorial, as using Adam allows us to demonstrate the autograd/gradient tape functionality with custom training loops, as per eager best practices.

We’ll define a little helper function that will load our content and style image, feed them forward through our network, which will then output the content and style feature representations from our model.


In [0]:
def feature_representations(model, content_path, style_path):
  """Helper function to compute our content and style feature representations.

  This function will simply load and preprocess both the content and style 
  images from their path. Then it will feed them through the network to obtain
  the outputs of the intermediate layers. 
  
  Arguments:
    model: the model that we are using
    content_path: the path to the content image
    style_path: the path to the style image
    
  Returns:
    The style and content features.
  """

  content_img = preprocess_img(content_path)
  style_img = preprocess_img(style_path)

  style_outputs = model(style_img)
  content_outputs = model(content_img)
  
  
  style_features = [
      style_layer[0] for style_layer in style_outputs[:num_style_layers]]
  content_features = [
      content_layer[0] for content_layer in content_outputs[num_style_layers:]]

  return style_features, content_features

Computing Loss & Gradients

Here we use tf.GradientTape to compute the gradient. It allows us to take advantage of the automatic differentiation available by tracing operations for computing the gradient later. It records the operations during the forward pass and then is able to compute the gradient of our loss function with respect to our input image for the backwards pass.


In [0]:
def compute_loss(
    model, loss_weights, init_img, gram_style_features, content_features):
  """Computes the total loss.
  
  Arguments:
    model: the model that will give us access to the intermediate layers
    loss_weights: the weights of each contribution of each loss function
      (style weight, content weight, and total variation weight)
    init_img: the initial base image, that is updated according to the
      optimization process
    gram_style_features: precomputed gram matrices corresponding to the 
      defined style layers of interest
    content_features: precomputed outputs from defined content layers of
      interest
      
  Returns:
    The total loss, style loss, content loss, and total variational loss.
  """
  style_weight, content_weight = loss_weights
  
  # Feed our init image through our model. This will give us the content and 
  # style representations at our desired layers.
  model_outputs = model(init_img)
  
  style_output_features = model_outputs[:num_style_layers]
  content_output_features = model_outputs[num_style_layers:]
  
  style_loss, content_loss = 0, 0

  # Accumulate style losses from all layers. All weights are equal.
  style_layer_weight = 1.0 / num_style_layers

  for target_style, generated_style in zip(
      gram_style_features, style_output_features):
    style_loss += style_layer_weight * compute_style_loss(
        generated_style[0], target_style)
    
  # Accumulate content losses from all layers. All weights are equal.
  content_layer_weight = 1.0 / num_content_layers
  for target_content, generated_content in zip(
      content_features, content_output_features):
    content_loss += content_layer_weight * compute_content_loss(
        generated_content[0], target_content)
  
  style_loss *= style_weight
  content_loss *= content_weight

  total_loss = style_loss + content_loss 

  return total_loss, style_loss, content_loss

In [0]:
def compute_gradients(cfg):
  with tf.GradientTape() as tape: 
    all_loss = compute_loss(**cfg)

  total_loss = all_loss[0]

  return tape.gradient(total_loss, cfg['init_img']), all_loss

Optimization Loop


In [0]:
import time
import IPython
from PIL import Image
import IPython.display

We now combine all the functions above into this optimization loop. While this looks like a lot of code, a significant portion of it is dedicated to displaying generated images and reporting loss and time.


In [0]:
def run_style_transfer(content_path, style_path, n_iterations=1000,
                       content_weight=1e4, style_weight=1e-4,
                       display_iterations=True):
  """Run the neural style transfer algorithm.
  
  Arguments:
    content_path: the filename of the target content image
    style_path: the filename of the reference style image
    content_weight: the weight for the content features, where higher means the
      generated image will put heavier emphasis on content (default 1e-4)
    style_weight: the weight for the style features, where higher means the
      generated image put heavier emphasis on style (default 1e4)
    n_iterations: the number of optimization iterations (default 1000)
    display_iterations: whether to display intermediate iterations of the
      generated images (default True)
    
  Returns:
    The final generated image and the total loss for that image.
  """

  model = get_model() 
  
  # We don't need to (or want to) train any layers of our model, so we set their
  # trainable to false. 
  for layer in model.layers:
    layer.trainable = False
  
  style_features, content_features = feature_representations(
      model, content_path, style_path)

  gram_style_features = [
      gram_matrix(style_feature) for style_feature in style_features
  ]
  
  init_img = preprocess_img(content_path)
  init_img = tfe.Variable(init_img, dtype=tf.float32)

  # The optimizer params are somewhat arbitrary.
  # See tensorflow.org/api_docs/python/tf/keras/optimizers/Adam#__init__
  opt = tf.train.AdamOptimizer(learning_rate=5, beta1=0.99, epsilon=1e-1)
  
  # Store the result that minimizes loss as the best one.
  best_loss, best_img = float('inf'), None
  
  # Create a nice config 
  loss_weights = (style_weight, content_weight)
  cfg = {
      'model':               model,
      'loss_weights':        loss_weights,
      'init_img':            init_img,
      'gram_style_features': gram_style_features,
      'content_features':    content_features
  }

  start_time = time.time()
  global_start = time.time()
  
  norm_means = np.array([103.939, 116.779, 123.68])
  min_vals = -norm_means
  max_vals = 255 - norm_means   

  imgs = []
  for i in range(n_iterations):
    
    gradients, all_loss = compute_gradients(cfg)
    total_loss, style_loss, content_loss = all_loss
    opt.apply_gradients([(gradients, init_img)])
    clipped = tf.clip_by_value(init_img, min_vals, max_vals)
    init_img.assign(clipped)
    end_time = time.time() 
    
    # Update best loss and best image from total loss. 
    if total_loss < best_loss:
      best_loss = total_loss
      best_img = deprocess_img(init_img.numpy())
      
    if display_iterations:
      
      n_rows, n_cols = 2, 5
      display_interval = n_iterations / (n_rows * n_cols)
  
      if i % display_interval == 0:
        start_time = time.time()

        plot_img = deprocess_img(init_img.numpy())
        imgs.append(plot_img)

        IPython.display.clear_output(wait=True)
        IPython.display.display_png(Image.fromarray(plot_img))

        print('Iteration: {}'.format(i))        
        print('Total loss: {:.4e}, ' 
              'style loss: {:.4e}, '
              'content loss: {:.4e}, '
              'time: {:.4f}s'.format(total_loss, style_loss, content_loss,
                                     time.time() - start_time))

  if display_iterations:
    IPython.display.clear_output(wait=True)

    plt.figure(figsize=(14,4))

    for i,img in enumerate(imgs):
        plt.subplot(n_rows, n_cols, i+1)
        plt.imshow(img)
        plt.axis('off')
    
    print('Total time: {:.4f}s'.format(time.time() - global_start))
      
  return best_img, best_loss

In [0]:
best_img, best_loss = run_style_transfer('eminem.jpg', 'fractal.jpg')
print(best_loss.numpy())

Let's visualize the final generated image.


In [0]:
plt.figure(figsize=(10, 10))

plt.imshow(best_img)
plt.axis('off')

plt.show()

Let's display all three images side by side.


In [0]:
plt.figure(figsize=(20, 60))

plt.subplot(1, 3, 1)
plt.imshow(load_img('eminem.jpg'))
plt.axis('off')
plt.title('Content Image', fontdict = {'fontsize' : 40})

plt.subplot(1, 3, 2)
plt.imshow(load_img('fractal.jpg'))
plt.axis('off')
plt.title('Style Image', fontdict = {'fontsize' : 40})

plt.subplot(1, 3, 3)
plt.imshow(best_img)
plt.axis('off')
plt.title('Generated Image', fontdict = {'fontsize' : 40})

plt.show()

Another Example

Now let's see what Times Square would look like when painted by Monet!

original image by Rafi B. from Somewhere in Texas :)


In [0]:
!wget --quiet -O 'times_square.jpg' https://upload.wikimedia.org/wikipedia/commons/9/9c/Times_square_at_night.jpg
!wget --quiet -O 'water_lilies.jpg' https://upload.wikimedia.org/wikipedia/commons/5/5d/Monet_Water_Lilies_1916.jpg
  
!ls

In [0]:
best_img, _ = run_style_transfer('times_square.jpg', 'water_lilies.jpg',
                                 display_iterations=False)

In [0]:
plt.figure(figsize=(20, 60))

plt.subplot(1, 3, 1)
plt.imshow(load_img('times_square.jpg'))
plt.axis('off')
plt.title('Content Image', fontdict = {'fontsize' : 40})

plt.subplot(1, 3, 2)
plt.imshow(load_img('water_lilies.jpg'))
plt.axis('off')
plt.title('Style Image', fontdict = {'fontsize' : 40})

plt.subplot(1, 3, 3)
plt.imshow(best_img)
plt.axis('off')
plt.title('Generated Image', fontdict = {'fontsize' : 40})

plt.show()

We can also tweak the content_weight and style_weight parameters of run_style_transfer to change the final generated image. The higher the content_weight parameter, the more content-heavy the generated image will be, and the higher the style_weight parameter, the more style-heavy the generated image will be.

Note that increasing the content_weight will have a similar effect to decreasing the style_weight, and vice versa.


In [0]:
style_heavy_img, _ = run_style_transfer('times_square.jpg', 'water_lilies.jpg',
                                        style_weight=1,
                                        display_iterations=False)

In [0]:
plt.imshow(style_heavy_img)
plt.axis('off')
plt.show()

In [0]:
content_heavy_img, _ = run_style_transfer('times_square.jpg', 'water_lilies.jpg',
                                          content_weight=1e8,
                                          display_iterations=False)

In [0]:
plt.imshow(content_heavy_img)
plt.axis('off')
plt.show()