Algorithms, Algorithms, Algorithms

Olivier HindsOlivier Hinds
11 min read

About My Blog

I have decided to start a blog, in order to improve both my writing skills, and my ability to research topics I enjoy and share my learning experiences with the world. I have chosen to write blogs quarterly, however, I may include some special releases when I have some extra time. If you have any feedback or advice, please reach out via email. Enjoy!

The JPEG Algorithm

JPEG (Joint Photographic Experts Group) is one of the oldest image compression standards of all time. It is a method of lossy compression that takes an image and reduces its file size by removing parts of the image that we “do not see” as well. This process takes away a lot of unnecessary information, that we would not perceive anyways. The reason I wanted to take a look at the JPEG algorithm is because of all the inner workings.

The JPEG Logo

How JPEG Works

As previously mentioned, the JPEG algorithm is very complex and has many parts. Here is a breakdown of the parts, and what they do. I will go into more detail further on.

  1. Colour Space Transformation

  2. Downsampling

  3. Block Splitting

  4. Discrete Cosine Transform

  5. Quantisation

  6. Entropy coding

So, what do these mean exactly?

Colour Space Transformation

As you will probably know, almost all images online are stored with the RGB colour space, meaning each pixel has a Red, Green, and Blue component. These values are from 0 --> 255, meaning they use 8 bits to store each component value. The JPEG algorithm takes advantage of human’s perception of luminance and chrominance and therefore requires the Y'CbCr colour space.

Note: There are plenty of colour spaces out there, Y'CbCr just happens to be the digital representation of YUV (Analogue; often used in video formats). There are other, potentially better colour spaces like YCoCg which are faster computationally and lossless (in conversion).

Find more information about YCoCg on the wiki Here

After converting our original image into Y'CbCr, we now have 3 values called luma (Luminance), and two chroma (Chrominance) values.

Y'CbCr allows for compression without much of an effect on the quality we can perceive.

Downsampling

In the human eye, we have 2 different types of light receptors. Rods and cones. These are distributed as follows:

NameQuantityFunction
Rods120 MillionPerceive light well, do not perceive colour.
Cones5 MillionCapable of perceiving colour, at higher light levels.

Ahh… We have a lot more Rod cells then cone cells. This means that humans perceive / respond to light more than they do colour.

The JPEG algorithm exploits this by keeping theY' component of the image as true to the original, and performing a “chroma subsample” or downsample on the other two components; Cb, and Cr.

This essentially reduces the “samples” in the image, by making larger pixels with an interpolated colour value from 4 surrounding particles

Chroma Subsampling in action, with compression ratios on the right.

Block Splitting

Next, we must split the image into 8×8 blocks (The smallest block of pixels in the image). This is called the Minimum Coded Unit (or MCU).

Potential errors:

  • Our image could potentially not be in an aspect ratio divisible by 8 (say 20×20), and then we would have to turn the image into a 24×24 image by filling in the pixels with black or some other form of dummy data.

We perform block splitting to make our next step easier.

Discrete Cosine Transform

Here is where the maths is… We need to convert this image into what's known as a frequency domain.

Essentially, this is representing the image by showing the different frequencies that make up the image.

The specific type of frequency domain conversion used on a JPEG image is a “normalised, two-dimensional type II DCT” (DCT-II), meaning that we can compress the image effectively, whilst preserving the image quality as much as possible.

The DCT generates an array of coefficients that condense the image or “concentrate” the colour signals into a more compact form.

Here is an example of an 8×8 MCU from an image:

$$\begin{bmatrix} 57 & 242 & 81 & 104 & 141 & 68 & 195 & 160 \\ 193 & 160 & 17 & 73 & 25 & 28 & 22 & 212 \\ 131 & 9 & 222 & 218 & 38 & 165 & 136 & 245 \\ 89 & 105 & 136 & 146 & 97 & 208 & 20 & 185 \\ 190 & 185 & 82 & 85 & 135 & 132 & 5 & 203 \\ 45 & 56 & 199 & 121 & 37 & 121 & 222 & 170 \\ 103 & 25 & 56 & 223 & 48 & 185 & 200 & 54 \\ 230 & 233 & 228 & 34 & 59 & 179 & 216 & 182 \\ \end{bmatrix}$$

The formula for a DCT-II transform is given below:

$$C(u, v) = (\frac{2}{N}) cos(\frac{(2x+1)u\pi}{2N}) cos(\frac{(2y+1)v\pi}{2N}) * f(x, y)$$

C(u, v) are the DCT coefficients, N is the size of the input matrix (8×8) and F(x, y) is the input pixel value at position x, y

Here is the formula specifically for our needs, with an 8×8 matrix substituted for N.

$$\ G_{u,v} = \frac{1}{4} \alpha(u) \alpha(v) \sum_{x=0}^7 \sum_{y=0}^7 g_{x,y} \cos \left[\frac{(2x+1)u\pi}{16} \right] \cos \left[\frac{(2y+1)v\pi}{16} \right]$$

(Sourced from wiki)

I then performed a DCT-II transform on the array with the following python code:

import numpy as np
from cv2 import dct

# Defining the array
arr = np.array([[57, 242, 81, 104, 141, 68, 195, 160],
                [193, 160, 17, 73, 25, 28, 22, 212],
                [131, 9, 222, 218, 38, 165, 136, 245],
                [89, 105, 136, 146, 97, 208, 20, 185],
                [190, 185, 82, 85, 135, 132, 5, 203],
                [45, 56, 199, 121, 37, 121, 222, 170],
                [103, 25, 56, 223, 48, 185, 200, 54],
                [230, 233, 228, 34, 59, 179, 216, 182]])

# Performing the DCT-II operation
transformed = dct(arr.astype(np.float32))
# Performs DCT-II when a 2D Array is given as an input
print(transformed) # Output the Array

Which output the following:

[[ 1.02137500e+03 -5.65788231e+01  1.36130768e+02 -8.51628418e+01
  -1.31250000e+01  2.36200790e+01  7.09291763e+01 -9.58326950e+01]
 [-6.04876060e+01  1.25421925e+01  2.49134140e+01 -6.41897964e+01
   1.11876762e+02 -1.08628700e+02 -4.90311699e+01 -3.04388466e+01]
 [ 3.17391701e+01  2.47170944e+01  8.54532471e+01  3.01764469e+01
  -9.49937439e+01  3.09210529e+01 -1.49793243e+02  5.07656403e+01]
 [-7.07679596e+01 -3.27083321e+01 -3.58156166e+01  8.65335751e+00
  -1.79999619e+01 -3.43061638e+01 -1.44502090e+02  2.80039177e+01]
 [ 8.16250000e+01  8.02951202e+01  3.73274345e+01  4.74395065e+01
  -4.13750000e+01 -1.92576111e+02 -3.23739052e+01  2.41051579e+01]
 [-2.06443876e-01 -1.07081596e+02 -1.84854126e+02  8.52923870e+00
  -4.46033020e+01  1.14809464e+02 -1.85137615e+01 -4.04136505e+01]
 [ 1.10922417e+02 -7.52456436e+01 -1.05432310e+01 -1.13657776e+02
  -1.08804749e+02  6.40518036e+01  3.75467567e+01  9.63051147e+01]
 [ 3.90549622e+01  8.39657307e+00 -2.65451221e+01  3.23340263e+01
   5.26742401e+01  3.13878918e+01 -7.73442078e+00 -1.60050240e+01]]

Note: The new range of values, instead of 0 —> 255 is now -128 —> 127.

This process is also reversible. With a slight modification to the program, we yield an almost perfect result:

import numpy as np
from cv2 import dct, DCT_INVERSE
np.set_printoptions(linewidth=500)
arr= np.genfromtxt('dct-res.csv', delimiter=',') 
# dct-res.csv has the results of the previous DCT operation
# Perform an inverse DCT-II operation
dct = dct(arr.astype(np.float32), flags=DCT_INVERSE)
print(dct)
print("-"*20)
print(dct.astype(int))

Output:

[[ 57.000008 242.        80.99999  104.       141.        67.99999  195.       160.      ]
 [193.       159.99998   16.999998  73.00001   25.000006  27.999985  22.000008 211.99998 ]
 [131.         9.00001  222.       218.        38.00001  165.00002  136.       245.      ]
 [ 89.00001  105.00001  136.       146.        96.999985 208.        19.999987 185.      ]
 [190.       185.        82.        85.00003  134.99998  132.         4.999998 203.      ]
 [ 44.999996  56.000004 199.       121.        37.00001  121.000015 222.       170.      ]
 [103.00001   25.000002  56.       223.        48.       185.       199.99998   54.00001 ]
 [230.       233.       228.        34.        59.       179.       216.       182.      ]]
--------------------
[[ 57 242  80 104 141  67 195 160]
 [193 159  16  73  25  27  22 211]
 [131   9 222 218  38 165 136 245]
 [ 89 105 136 146  96 208  19 185]
 [190 185  82  85 134 132   4 203]
 [ 44  56 199 121  37 121 222 170]
 [103  25  56 223  48 185 199  54]
 [230 233 228  34  59 179 216 182]]

As you can see, there are a couple of minor errors in these calculations, most likely from floating-point arithmetic errors.

Quantisation

Here is where the real magic happens; The actual compression. In this step, we divide each component in the frequency domain by a constant defined in our quantisation table.

We then round these values to the nearest integer.

This quantisation table can be adjusted depending on how much of the original information is desired.

Even if the quantisation table is entirely 1’s (100% Quality), there will still be some data reduction in the other stages of the algorithm. Here is an example of a quantisation table taken from the JPEG Standard (50% Quality).

$$Q=\begin{bmatrix} 16 & 11 & 10 & 16 & 24 & 40 & 51 & 61 \\ 12 & 12 & 14 & 19 & 26 & 58 & 60 & 55 \\ 14 & 13 & 16 & 24 & 40 & 57 & 69 & 56 \\ 14 & 17 & 22 & 29 & 51 & 87 & 80 & 62 \\ 18 & 22 & 37 & 56 & 68 & 109 & 103 & 77 \\ 24 & 35 & 55 & 64 & 81 & 104 & 113 & 92 \\ 49 & 64 & 78 & 87 & 103 & 121 & 120 & 101 \\ 72 & 92 & 95 & 98 & 112 & 100 & 103 & 99 \end{bmatrix}$$

We simply round each value, after dividing it by its quantisation coefficient.

quantized_arr = np.round(np.divide(dct, Q))

Which yields these results:

[[64. -5. 14. -5. -1.  1.  1. -2.]
 [-5.  1.  2. -3.  4. -2. -1. -1.]
 [ 2.  2.  5.  1. -2.  1. -2.  1.]
 [-5. -2. -2.  0. -0. -0. -2.  0.]
 [ 5.  4.  1.  1. -1. -2. -0.  0.]
 [-0. -3. -3.  0. -1.  1. -0. -0.]
 [ 2. -1. -0. -1. -1.  1.  0.  1.]
 [ 1.  0. -0.  0.  0.  0. -0. -0.]]

The values closer to the upper-left corner of the matrix are more significant, and values approach 0 further to the bottom-right.

This is due to the nature of the DCT, preserving coarser information in the upper left and removing fine information not perceivable by humans. Here is the DCT below:

Image taken from Wiki

Entropy Coding

This step is used to convert our matrix into a sequence of integers, by passing over the array across its diagonals from top left to bottom right (in a “zig-zag” pattern).

This is because, we are more likely to see integers at the top-left, and therefore when we compact this data, the RLE will be more optimised.

Run Length Encoding (RLE)

Essentially, we take our 1D array of data, and instead of writing values like 0, 0, 0, we write them like 0[x3] to save space. This has a significant part in the result and is often used on all sorts of data. In fact, I will be learning about RLE in my EPQ project to come. (It is related to Cellular Automata and is a file format used to store patterns. Learn more here)

Huffman Encoding

This is a separate encoding algorithm which assigns shorter bit codes to the more frequent values in our data, and longer codes to the less frequent. This results in an even higher degree of compression. (This is quite a complex topic, and if you want to learn more about this and other algorithms, you can do so with this book by Thomas H. et al. (pg. 431))

Decoding Our JPEG

In order to decode our JPEG, we can simply follow all of these steps in reverse. We start by decoding our image, through a Huffman decoder, and our Run Length Decoder. Once we’ve done this, we can convert our 1D array of values into a 2D image by “unravelling” the zig-zag pattern. We know the dimensions of the image because this is stored in the metadata of the image (As well as other important information such as which quality/quantisation table was used).

Our matrix is then multiplied with the quantisation table using a Hadamard product (element-wise) denoted as:

$$\text{Image} \odot Q$$

Which is as close to the reverse of the above step:

np.round(np.divide(dct, Q))

Which performs Hadamard division with rounding.

$$\text{Quantised result} = \text{round(}dct\oslash Q)$$

Once we have done this, all we need to do is convert our image back into the RGB colour space, with a simple Y'CbCr to RGB conversion. This can be done using OpenCV and the cv2.COLOR_YCrCb2RGB Function:

rgb = cv2.cvtColor(dct, cv2.COLOR_YCrCb2RGB)

And we should have our image.

Optimisations

There are plenty of ways to optimise this, one of the more notable optimisations was performed by the developers at libjpeg-turbo, who have accelerated the encoding and decoding of JPEGs using Single Instruction, Multiple Data (SIMD) Instructions, as well as optimised Huffman coding routines.

With this, It can perform 2-6x as fast as libjpeg. You can find the GitHub repository here

That’s All

Thank you very much for reading.

I recommend taking a look at all the above links, as well as this PDF with an explanation of all the technical terms.

Expect to see Blog posts quarterly with more exciting topics on the way. As 2023 goes on, I may expand these articles, so do check back if you’re interested.

I’m open to any feedback you may have. Please do not hesitate to get in contact via email if you have any queries.

0
Subscribe to my newsletter

Read articles from Olivier Hinds directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Olivier Hinds
Olivier Hinds

I'm a passionate Software Developer with a keen interest in becoming a Mechanical Engineer. My journey in software development has been a crucial stepping stone towards this goal, enhancing my skills and preparing me for the exciting challenges that lie ahead.