Neural Style Transfer

Yichun ZhaoYichun Zhao
5 min read

What is Neural Style Transfer

Input a content(image) to a neural network, and output the content in a new style.

What are deep Conv Nets learning?

ConvNet is like a series of filters that look at images to identify patterns. The first layer of the network focuses on very basic features, such as edges or colors. Imagine you are looking at a painting and trying to find straight lines or specific colors. This is what the first layer does—it looks for simple shapes and colors in small parts of the image.

As you move deeper into the network, the layers start to recognize more complex patterns. For example, the second layer might look for combinations of edges to form shapes, like circles or squares. By the time you reach the deeper layers, the network is identifying entire objects, like dogs or cars, by recognizing intricate details and textures. It’s like going from noticing just the brush strokes in a painting to understanding the whole scene depicted in it!

Visualizing deep layers: Layer 1

Visualizing deep layers: Layer 3

Visualizing deep layers: Layer 5

Cost Function

Neural Style Transfer is a deep learning technique that allows us to generate a new image by blending the content of one image with the style of another. For example, imagine you have a photo of your friend and a famous painting—Neural Style Transfer can create a new image that looks like your friend's photo painted in the same artistic style as that painting.

The process relies on a cost function, which acts like a score to evaluate how well the generated image captures both the content and the style. This cost function has two components: one measures how closely the new image matches the original photo (content), and the other checks how well it reflects the artistic features like brush strokes, textures, and colors from the painting (style).

\(J(G) = α J_{content}(G, C) + β J_{style}(G, S)\)

Where:

  • J(G): The total cost function for the generated image G.

  • J_content(G, C): The content cost function that measures how similar the generated image G is to the content image C.

  • J_style(G, S): The style cost function that measures how similar the generated image G is to the style image S.

  • α: A hyperparameter that controls the weight of the content cost.

  • β: A hyperparameter that controls the weight of the style cost.

By adjusting the balance between these two components using hyperparameters, we can control how much the final image leans toward preserving the structure of the photo versus mimicking the painting’s look. Through iterative optimization, the image is gradually refined—much like tweaking a recipe until the final dish has just the right taste and presentation.

The essential purpose of Neural Style Transfer is to:

  1. Minimize the difference between the content image (C) and the generated image (G): This ensures that the generated image retains the important content features of the original image.

  2. Minimize the difference between the style image (S) and the generated image (G): This ensures that the generated image reflects the artistic style of the style image.

Content cost function

  • Say you use hidden layer \(l\) to compute content cost

  • Use pre trained Conv. net (E.g., VGG network)

  • Input content C and generated G into the pre trained Conv. net respectively.

  • Let \(a^{[l](C)}\) and \(a^{[l](G)}\) be the activation of layer \(l\) on the images

  • if both values above are similar, both images have similar content

Style cost function

Style is defined as the correlation between activations across different channels in a specific layer of a convolutional neural network.

  • Channel activations: each channel in a layer capture different features of input image.

  • Correlation measurement: the style is quantified by calculating the correlation between the activations of different channels. High correlation indicates that certain features tend to occur together, while low correlation suggests they do not.

  • Style Matrix: Let \(a^{[l]}_{i,j,k}\), = activation at \((i,j,k)\)

    • \((i)\): This index represents the row position (height) of a pixel in the activation map. It indicates the vertical position of the pixel.
    • \((j)\): This index represents the column position (width) of a pixel in the activation map. It indicates the horizontal position of the pixel.
    • \((k)\) : In the notation \(a^{l}_{i,j,k} \) , \((k)\)indexes the channel, while \((i)\) and \((j)\)index the spatial dimensions (height and width) of the activation map. So, \(( k )\)refers to one of the multiple channels, not a single pixel or cell.
    • \((k')\): the index represents a paired channel that is being compared with channel \(k\) in the style matrix.
    • Activation Map: The activation map is a 3D tensor with dimensions corresponding to height \(n_h\), width \(n_w\), and the number of channels \(n_c\). Each channel captures different features of the input image.
    • S: style image
    • G: style matrix itself, which captures the correlations between the activations of different channels in the specified layer.

$$G^{[l]}_{k,k'} = \sum_{i=1}^{n_h^{[l]}} \sum_{j=1}^{n_w^{[l]}} a^{[l]}_{i,j,k} \cdot a^{[l]}_{i,j,k'}$$

$$G^{[l]}_{k,k'} = \sum_{i=1}^{n_h^{[l]}} \sum_{j=1}^{n_w^{[l]}} a^{[l]}_{i,j,k} \cdot a^{[l]}_{i,j,k'}$$

$$J_{\text{style}}^{[l]}(S, G) = \left\| G^{l} - G^{([l])(G)} \right\|F^2 = \frac{1}{2n_{h}^{[l]}n_{w}^{[l]}n_{c}^{[l]}}\sum_{k} \sum_{k'} \left( G^{[l]}{k,k'}- G^{[l]}{k,k'} \right)^2$$

$$J_{\text{style}}(S, G) = \sum_{l}{\lambda}^{[l]}J_{\text{style}}^{[l]}(S, G)$$

$$J(G) = \alpha J_{\text{content}}(C,G)+ {\beta} J_{\text{style}}(S, G)$$

And technically, I've been using the term correlation to convey intuition but this is actually the unnormalized cross of the areas because we're not subtracting out the mean and this is just multiplied by these elements directly.

1D and 3D Generalizations

Most of the discussion has focused on images, on sort of 2D data, because images are so pervasive(widespread, commonly found everywhere)

1D : EKG**(electrocardiogram)** is a time series showing the voltage at each instant in time.

Conv net can applied on the 1D data, using 1D filters.

3D: CT scan : slices of human body, data h, w, d

Conv net can applied on the 3D data, using 3D filters.

0
Subscribe to my newsletter

Read articles from Yichun Zhao directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Yichun Zhao
Yichun Zhao

Developer | Adept in software development | Building expertise in machine learning and deep learning