Learning about neural style transfer is a great place to start if you want to improve in Convolutional Neural Networks (CNN).

In this article, we will be implementing neural style transfer in python using the VGG19 model and applying the styles of one image to another. The codes are implemented using the mathematics from the paper "A Neural Algorithm of Artistic Style" as a reference.

Let's start!

What is neural style transfer

Neural style transfer is a technique that takes two input images- a content image and a style reference image and gives an output image called a stylized image. This image has the same content as the content image and has a style similar to the style image.

The architecture of the VGG-19 model

We will be using this model for our style transfer so let's understand what happens in its architecture first.

Extracting content

Along the processing hierarchy of the network, the input image is transformed into representations that increasingly care about the actual content of the image compared to its detailed pixel values.
We, refer to the feature responses in higher layers of the network as the content representation.
The second filter of the fifth convolutional layer (conv5_2) of the pre-trained VGG-19 network is used as a content extractor.

Extracting style

To obtain a representation of the style of an input image, we use correlations between the different filter responses across the different parts of the image. We obtain a stationary, multi-scale representation of the input image, which captures its texture information but not the global arrangement.
The correlations between feature maps are known as the gram matrix.
The layers used for the calculation of the gram matrix are conv1_1, conv2_1, conv3_1, conv4_1, and conv5_1.

Style transfer using VGG-19 in Tensorflow

Preparing environment

Note: If you already know how to enable GPU in Google Colab, you can skip this part👌.

In my case, I am programming in Google Colab so that I can train the neural network on GPUs for free. Using GPU reduces the training time by a huge amount.

To enable GPU on Colab you have to:

Go to "change runtime type" under Runtime:
Select GPU as a hardware accelerator:

With that we will have access to a GPU. Now we have to make Tensorflow use it. To do so we have to run the following code.

# Make tensorflow use GPU
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Now everything we do on Colab will run on GPU.

Importing libraries and images

import numpy as np
import matplotlib.pyplot as plt
import os

%matplotlib inline

from tensorflow import keras
from keras import optimizers
from keras.optimizers import schedules

import PIL
import cv2

For content and style images, I mounted my google drive and then used the drive images file path.

# Load content and style images
content_path = '/content/drive/MyDrive/Images/chicago.jpg'
style_path = '/content/drive/MyDrive/Images/great-wave.jpg'

# read the image file in numpy array
content = plt.imread(content_path)
style = plt.imread(style_path)

# Display the images
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,15))
ax1.imshow(content)
ax1.set_title('Content image')
ax2.imshow(style)
ax2.set_title('Style image')
plt.show()

Gram matrix and loss functions

Gram matrix

To get the correlation of all the channels w.r.t each other we need to calculate the gram matrix, we will use the gram matrix to measure the degree of correlation between channels which later will act as a measure of style itself.

In simple words, a gram matrix is a matrix created by multiplying a matrix with its transpose. The dot product of the transpose of the matrix is created by a vector of feature maps and the matrix itself gives a gram matrix.

def gram_matrix(input_tensor):
  input_tensor = tf.transpose(input_tensor, (2, 0, 1))
  features = tf.reshape(input_tensor, [tf.shape(input_tensor)[0], -1])
  gram = tf.matmul(features, tf.transpose(features)) 
  return gram

Now that we have the Gram matrix we can calculate the loss function of the style, which is the degree of correlation between the styles within a layer.

Style cost

To calculate the loss function we are going to calculate the Gram matrix of both the image to be transferred and the resulting image and calculate the mean square error.

Where "A" and "G" are the style representations in layer l which are calculated using the gram matrix function. This cost function will later get used to calculate style loss.

def style_cost(style, combination):
  G = gram_matrix(style)
  A = gram_matrix(combination)
  channels = 3
  size = img_nrows * img_ncols
  return tf.reduce_sum(tf.square(G-A)) / (4.0 * (channels ** 2)*(size ** 2))

Content cost

The content loss function is much simpler than the style function. Because similar images tend to have similar deep layers. Therefore, if two images have similar content, then they will have similar deep layers.

Where "p" is the input content image, "x" is generated combined image and "l" is the layer whose activation we are going to use to compute loss. In my function below there won't be any necessity to send an activation layer.

def content_cost(content, combination):
  return tf.reduce_sum(tf.square(combination - content)) / 2

Loading the model

# Loading VGG19 model
from keras.applications import vgg19
from keras.utils import plot_model

model = vgg19.VGG19(weights="imagenet", include_top=False)

The "include_top" parameter is set to false because we are using our model for feature extraction so we won't need the classifier part. And we are using the weights used by VGG19 on imagenet dataset.

Calculation of loss function

First, let us create a feature extractor from a pre-trained model.

from keras.models import Model
outputs_dict = dict([(layer.name, layer.output) for layer in model.layers])
feature_extractor = Model(inputs=model.inputs, outputs=outputs_dict)

The first line of the code creates a Python dictionary that maps the name of each layer in the pre-trained model to its output tensor. The second line of the code creates a new model object named feature_extractor. This new model takes the same inputs as the original pre-trained model (model.inputs) but its outputs are the dictionary created in the previous step. The resulting feature_extractor can be used to extract features from inputs using the pre-trained model's layers.

Now, we are going to define which layers we are going to use to calculate the loss function of the style and which layer we are going to use to calculate the loss function of the content.

style_layers = [
    "block1_conv1",
    "block2_conv1",
    "block3_conv1",
    "block4_conv1",
    "block5_conv1",
]

content_layer = "block5_conv2"

content_weight = 2.5e-8
style_weight = 1e-6

def loss_function(combination_image, content_image, style_image):
  # 1. Combine all the images in the same tensor
  input_tensor = tf.concat(
      [content_image, style_image, combination_image],
      axis = 0
  )
  # 2. Get the values in all the layers for the three image
  features = feature_extractor(input_tensor)
  # 3. Initialize the loss
  loss = tf.zeros(shape=())
  # 4. Extract the content layers + content loss
  layer_features = features[content_layer]
  content_image_features = layer_features[0, :, :, :]
  combination_features = layer_features[2, :, :, :]

  loss = loss + content_weight * content_cost(
      content_image_features, combination_features
  )
  # 5. Extract the style layers + style loss
  for layer_name in style_layers:
    layer_features = features[layer_name]
    style_features = layer_features[1, :, :, :]
    combination_features = layer_features[2, :, :, :]
    sl = style_cost(style_features, combination_features)
    loss += (style_weight / len(style_layers)) * sl

  return loss

Learning of the Neural Style Transfer network

We will now write another function which will:

Calculate the gradients of the loss function we just defined.
Use these gradients to update the target image.

With GradientTape, we can take advantage of automatic differentiation, which can calculate the gradients of a function based on its composition. We will also use the tf.function decorator to speed up the operations.

When a function is decorated with @tf.function(), Tensorflow will convert the python function into a Tensorflow graph, which can be executed much more efficiently than the original Python code.

@tf.function
def compute_loss_and_grads(combination_image, content_image, style_image):
  with tf.GradientTape() as tape:
    loss = loss_function(combination_image, content_image, style_image)
    grads = tape.gradient(loss, combination_image)
    return loss, grads

With this we have the learning phase done.

Image processing and generation

Preprocess image

The preprocessing of the images consists of giving the images the format that our network requires.

def preprocess_image(image_path):
  # Util function to open, resize and format images into appropriate tensors
  img = keras.preprocessing.image.load_img(
      image_path, target_size=(img_nrows, img_ncols)
  )
  img = keras.preprocessing.image.img_to_array(img)
  img = np.expand_dims(img, axis=0)
  img = vgg19.preprocess_input(img)
  return tf.convert_to_tensor(img)

Deprocess image

To deprocess the images we will have to follow an almost reverse process to the one we have used to process the images.

# Deprocess image
def deprocess_image(x):
  # Convert tensor to array
  x = x.reshape((img_nrows, img_ncols, 3))
  x[:, :, 0] += 103.939
  x[:, :, 1] += 116.779
  x[:, :, 2] += 123.68
  # Convert BGR to RGB
  x = x[:, :, ::-1]
  # We make sure it is in between 0 and 255
  x = np.clip(x, 0, 255).astype("uint8")
  return x

Training our Neural Style Transfer network

Now that we have all the functions ready, creating the training loop is quite simple. But first, we are going to create a simple function that generates our final image.

# Function to generate an image
def show_result(iteration):
  img = deprocess_image(combination_image.numpy())
  img = np.array(img, dtype=np.uint8)
  if np.ndim(img)>3:
    assert img.shape[0] == 1
    img = img[0]
  img = PIL.Image.fromarray(img)
  plt.imshow(np.array(img))
  plt.show()from keras.optimizers import SGD
width, height = tf.keras.utils.load_img(content_path).size
img_nrows = 400
img_ncols = int(width * img_nrows / height)

optimizer = SGD(
    keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate=100.0, decay_steps=100, decay_rate=0.96
    )
)

content_image = preprocess_image(content_path)
style_image = preprocess_image(style_path)
combination_image = tf.Variable(preprocess_image(content_path))

iterations = 4000

for i in range(1, iterations + 1):
  loss, grads = compute_loss_and_grads(
      combination_image, content_image, style_image
  )
  grads_and_vars = [(grads, combination_image)]
  optimizer.apply_gradients(grads_and_vars)
  if i % 500 == 0:
    print("Iteration %d: loss=%.2f" % (i, loss))
  if i == 4000:
    show_result(i)

Now that we have everything prepared, let's code the main training loop of our Neural style transfer network. We will be using Stochastic Gradient Descent i.e SGD optimizer which is an iterative method for optimizing an objective function with suitable smoothness properties. We can use other optimizers as well but upon using the Adam optimizer I couldn't get as good of a result as SGD so I used this optimizer.

from keras.optimizers import SGD
width, height = tf.keras.utils.load_img(content_path).size
img_nrows = 400
img_ncols = int(width * img_nrows / height)

optimizer = SGD(
    keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate=100.0, decay_steps=100, decay_rate=0.96
    )
)

content_image = preprocess_image(content_path)
style_image = preprocess_image(style_path)
combination_image = tf.Variable(preprocess_image(content_path))

iterations = 4000

for i in range(1, iterations + 1):
  loss, grads = compute_loss_and_grads(
      combination_image, content_image, style_image
  )
  grads_and_vars = [(grads, combination_image)]
  optimizer.apply_gradients(grads_and_vars)
  if i % 500 == 0:
    print("Iteration %d: loss=%.2f" % (i, loss))
  if i == 4000:
    show_result(i)

Conclusion

Our implementation style transfer ends here, you can play with the hyper-parameters and can also use other networks like Alexnet to produce even better effects. This was my code implementation project as a part of understanding CNN. For the full Colab notebook, you can look here. Hope this article helped you.

Do you have any suggestions for improving my code? Let me know 😊.

See you later!

Learn to transfer artistic style of an image using Neural style transfer

Table of contents