We know that the images and videos we see around us are often used to interact with AI for multiple forms of analysis and research. In fact, AI has advanced so much that we can even generate such images and videos through simple prompts.

But have you ever wondered about the very first step in the process? We all know that all AI can work upon is numbers. Then how are these images and videos uses as input for the model? The process is called Image feature extraction and involves multiple mathematical shenanigans which will amaze you (you might’ve seen them beforein your linear algebra class).

Necessary Modules

Before we start officially, let’s import the modules which will be needed.

os – to load the dataset.
numpy and scipy – for matrix handling and calculations
opencv – to read the image (Don’t worry! We won’t be using it more than once)

Loading the Dataset

The AT&T dataset is generally named as ‘archive’ and the directories inside it are named as ‘s1’, ‘s2’ and so on. There are 40 such directories, along with a readme file. In every directory, we have 10 images of the face of a person from different angles named ‘1.pgm’, ‘2.pgm’ and so on.

So we will need to traverse all directories while also ignoring the readme file and read all ten images of each directory. For that:

import os

path = os.path.join(os.getcwd(), 'archive')
sub_list = [f for f in os.listdir(path) if f[1:].isdigit()]

where the if f[1:].isdigit() checks if the second letter in the name of the directory is a digit which helps us exempt the readme file and covers all the required directories.

def natural_key(item):
    prefix = item[0]
    num = int(item[1:])
    return (prefix,num)

sub_list.sort(key=natural_key)

This part of the code is optional but it makes it a little bit tidy if the filenames are sorted. Now we have the list of all the directories inside the ‘archive’ folder. You can verify if all the data is loaded by checking the length of this array which must be 40. Let’s declare separate arrays for training and testing data and their labels.


train_data = []
test_data = []
train_labels = []
test_labels = []
split_ratio = 0.7

target_size = (92, 112)

Now, we need to traverse every directory, and read the images inside it. But while doing this, it will be efficient to split the data into train and test here only. This is because doing it after all the dataset is traversed will not make a stratified sample and will nullify all effort.

For example, let’s say we were to split the dataset into a 75:25 ratio, if we do that after all the dataset is compiled, it will allot first 30 classes to the train dataset and the next ten to the test dataset.

To obtain a stratified sampling, we have to split the the files for every directory, so it makes sense to do that in this loop only.

import cv2 as cv 

for subject in sub_list:
    pathname = os.path.join(path, subject)
    files = [f for f in os.listdir(pathname) if os.path.isfile(os.path.join(pathname, f))]


    split_index = int(len(files) * split_ratio)

    for i, file in enumerate(files):
        full_path = os.path.join(pathname, file)
        img = cv.imread(full_path, cv.IMREAD_GRAYSCALE)
        img_resized = cv.resize(img, target_size, interpolation=cv.INTER_AREA)
        img_flattened = img_resized.flatten()

        if i < split_index:
            train_data.append(img_flattened)
            train_labels.append(subject)
        else:
            test_data.append(img_flattened)
            test_labels.append(subject)

The code might look superfluous with the isfile() verification and image resizing because the dataset is consistent with size and file types but it is still a better approach to do so. That is because in general scenarios, we have to detect the object (face here) from the provided image and then select that portion for recognition so resizing the image for consistency makes sense. However, we don’t have to do that here because the dataset is well prepared and consistent in image sizes.

import numpy as np 

train_data = np.array(train_data, dtype=np.float32)
test_data = np.array(test_data, dtype=np.float32)
train_labels = np.array(train_labels, dtype='<U')
test_labels = np.array(test_labels, dtype='<U')

This part converts the arrays into numpy arrays for faster computation.

Now that the dataset is properly loaded, let’s begin the interesting part. To convert these images into numbers we use a technique called PCA or Principal component analysis. In this, we :

Convert each image to a 1-D vector.
Calculate mean face by averaging all training images.
Subtract mean face from each image to get normalised faces.
Compute covariance matrix.
Calculate eigenvectors and eigenvalues.
Keep top K eigenvectors (typically 50-100) to form eigenspace.
Project all training images onto eigenspace to get feature vectors

The concept behind this is that eigenvectors are vectors that maintain their direction when transformed by a matrix, only being scaled by eigenvalues. In PCA, we use the covariance matrix of our data. Covariance measures how features vary together.

Eigenvectors of covariance matrix point in directions of maximum variance. Larger the variance, more the information it holds.

But if you actually try to implement this, you will find it very time consuming. In general cases, image datasets are much larger than this one. So to work-around this problem, we use SVD.

Singular Value Decomposition

SVD stands for Singular Value Decomposition. It is a technique in which we can represent (or decompose) a given matrix as a multiplication of three matrices :

A = U*S*Vt

Here, U is the m x m matrix of orthonormal eigenvectors of AA⊤. Vt is the orthonormal eigenvectors of A⊤A and S is the diagonal matrix with r elements equal to the root of the positive eigenvalues of A⊤A or AA⊤ (because both matrices have the same positive eigenvalues).

If this looks very sudden and out of the blue then don’t worry. Let’s take a look at its proof.

We know that for any matrix A (which is m x n) A⊤A will be symmetric. Also, A⊤A is a Positive Semi-Definite matrix, i.e, all its eigenvalues will be non-negative. We can easily prove this using the quadratic form of the matrix.

Since A⊤A is symmetric, it has:

An orthonormal basis of eigenvectors {vi}
Real eigenvalues λi ≥ 0 (since it's positive semidefinite)

For any eigenvector vi of the matrix A⊤A:

$$\begin{align} A^TAv_i &= \lambda_i v_i \\ &= (\sigma_i)^2v_i \end{align}$$

Here, we define σi as the square root of the i-th eigenvalue, which means:

$$\begin{align} ||Av_i||^2 &= v_i^TA^TAv_i \\ &= v_i^T(\sigma_i)^2v_i \\ &= \sigma_i^2 \hspace{4em} [\space since \space v_i \space is \space orthonormal \space] \end{align}$$

Now the crucial deductive step: Avi represents how A transforms vi. We know:

Its length is √λi (let's call this σi)

These transformed vectors {Avi} must form an orthogonal set because:

$$\begin{align} (Av_i)^T(Av_j) &= v_i^T(A^TA)v_j \\ &= \lambda_i v_i^T v_j \\ &= 0 \hspace{3em} for \space i \space \neq \space j \end{align}$$

Therefore, {Avi/σi} forms an orthonormal set of vectors. These are our ui vectors:

$$u_i = \frac{Av_i}{\sigma_i}$$

From this construction: Avi = σiui Which, when written in matrix form, gives us: AV = UΣ Therefore, A = UΣV⊤. Hence Proved.

Still not satisfied? You can watch this video to have a more intuitive understanding of the workings of SVD through its geometric interpretation.

Preprocessing the Data


mean_face = np.mean(train_data, axis=0)
mean_face_img = mean_face.reshape(target_size)
mean_face_img = cv.normalize(mean_face_img, None, 0, 255, cv.NORM_MINMAX)
mean_face_img = mean_face_img.astype(np.uint8)

normalized_images = train_data - mean_face

Now you can do all the calculations which we discussed before, but to save time and improve calculation speed and accuracy, we will use scipy.

import scipy as sp

U, S, Vt = sp.linalg.svd(normalized_images, full_matrices=False)

eigenvalues = (S**2) / (normalized_images.shape[0] - 1)
eigenvectors = Vt.T

This gives us the eigenvalues and eigenvectors. Now we can fetch out the top 100 eigenvectors sorted in descending based on the magnitude of their corresponding eigenvalues using np.argsort().

These top 100 eigenvectors will form the eigenspace upon which we will project the normalised images to form the training features.

sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_indices]
eigenvectors = eigenvectors[:, sorted_indices]

eigenspace = eigenvectors[:, :100]

train_features = np.dot(normalized_images, eigenspace)

Let’s do the same preprocessing for test data as well.

normalized_test_images = test_data - mean_face
test_features = np.dot(normalized_test_images, eigenspace)

Preparing the model

Finally, it’s time to put the data in a model and find for ourselves how we have fared. For the sake of simplicity, we will use the KNN algorithm.

K-Nearest Neighbours (KNN) is a simple and effective supervised learning algorithm used for classification and regression tasks. It operates by identifying the k-nearest data points (neighbours) to a given query point in the feature space and predicts the output based on the properties of these neighbours. For classification, it uses majority voting among the neighbours' labels, and for regression, it averages their values.

That’s also why the k-value is chosen to be always odd. One specific advantage of using KNN in our case is that it take no training time since all the decisions are taken while predicting the classes. This might not suit larger datasets but in our case it reduces the complexity of our code since we don’t have to setup a separate pipeline.

from collections import Counter

def knn_predict(train_features, train_labels, test_feature, k=3):
    distances = [np.linalg.norm(train_feature - test_feature) for train_feature in train_features]
    k_indices = np.argsort(distances)[:k]
    k_nearest_labels = train_labels[k_indices]
    most_common_label = Counter(k_nearest_labels).most_common(1)[0][0]
    return most_common_label

It’s all done. All that’s left is to use the model to get our predictions and test its accuracy.

predictions = []
for test_feature in test_features:
    predicted_label = knn_predict(train_features, train_labels, test_feature, k=3)
    predictions.append(predicted_label)
predictions = np.array(predictions)

correct = sum(pred == true for pred, true in zip(predictions, test_labels))
accuracy = correct / len(test_labels)
print(f"Manual Accuracy: {accuracy * 100:.2f}%")

In our case, we get an accuracy of around 96% for k=3. If this seems too high to be true, you can use scikit-learn’s k-cross validation for verifying your results. For our example, the results dropped to 92% after cross validation for 5 iterations, but hey, that’s still great for a scratch implementation.

Singular Value Decomposition for Feature Extraction