What is face recognition?

Two soft of problem:

Face verification : one to one problem, one image matching name/ID
Face recognition: one to many problem, one matching many face images

face verification is a one shot problem.

One Shot Learning

It means that learning from one example to recognize the person again.

so meaning that you have only one image of this person in the database, but you may recognize this person from the camera when it shows up.

The traditional CNN with a SoftMax output layer struggles with the introduction of new classes. For example, when a new employee joins the team, the model typically requires retraining and an adjustment to the SoftMax output to accommodate the new identity.

So the problem should be reformed to Leaning a “similarity“ function; meaning that comparing two images by measuring the degree of difference between them.

if the difference of two images less than a value, then we think they are the same.

Siamese Network

Compare two photos by using a Convolutional Neural Network (CNN) to extract their feature representations. Then, analyze the feature spaces to identify similarities between the images.

Triplet Loss

a triplet refers to a set of three images used used in training a neural network, specifically for tasks like face recognition. The triplet consists of:

Anchor(A): the reference image
Positive (P): an image of the same person as the anchor
Negative (N): an image of a different person

in a triplet, the Anchor (A) and the Positive (P) images are of the same person. The purpose of this is to ensure that the neural network learns to recognize and encode features that are similar for the same individual.

In contrast, the Negative (N) image is of a different person. This setup helps the model learn to differentiate between individuals effectively.

The goal is to ensure : distance(A, P) + margin < distance(A, N)

The Triplet Loss function is

$$\mathcal{L}(A, P, N) = \max \left( 0, \, \|f(A) - f(P)\|^2 - \|f(A) - f(N)\|^2 + \alpha \right)$$

Explanation:

$f(\cdot)$: embedding function (usually a neural network)
( A ): anchor sample
( P ): positive sample (same class as anchor)
( N ): negative sample (different class)
$\alpha$: margin; It’s a hyperparameter you choose — typically a small positive value like 0.1, 0.2, or 1.0
$||⋅||^2$: squared Euclidean distance between embeddings

This ensures the model only gets penalized when:

$$∥f(A)−f(N)∥^2 ≤∥f(A)−f(P)∥^2+α$$

This loss encourages $f(A)$ to be closer to $f(P)$ than to $f(N)$ by at least margin $\alpha$

Cost

$$\mathcal{J} = \sum_{i=1}^{m} \mathcal{L}(A^{(i)}, P^{(i)}, N^{(i)})$$

training set: 10 k pictures over 1 k persons

Choosing triplets A,P,N

The principle is to select Hard triplets, We hope the training data set are hard to be distinguished.

In the case of hard triplets, the Anchor (A) and Positive (P) images are very similar, while the Anchor (A) and Negative (N) images are also somewhat similar, making it challenging for the model to distinguish between them.

However, it is not suggested triplets are randomly selected, maybe the goal is easily satisfied.

Face Verification and Binary Classification

Learning similarity function

In the context of face recognition, you can treat the problem as a binary classification task by comparing the outputs of two neural networks (like in a Siamese Network). In a binary classification setup for face recognition using a Siamese Network, you typically use two identical neural networks (the Siamese networks) that share the same weights and parameters.

Here's a brief overview:

Triplet Loss: This method learns to differentiate between an anchor, a positive, and a negative example, focusing on minimizing the distance between the anchor and positive while maximizing the distance from the negative.
Binary Classification: Instead of using triplets, you can use pairs of images. The model predicts whether the two images are of the same person (output 1) or different persons (output 0).

Face Recognition Note