Demystifying HOG Feature Extraction: A Beginner-Friendly Guide


Remember how we used SVD to extract features from an image which helped our machine “see”? Today we are going to discuss one more such method.
HOG (Histogram of Oriented Gradients) is a feature descriptor that captures the shape and structure of objects in an image by analyzing gradients of directions (edges) across x and y axes.
HOG has proven to be useful as it works well despite slight variations in lighting and pose and is computationally efficient for real-time use cases, especially for person detection and face recognition.
In this article, we’ll break down a Python implementation of Histogram of Oriented Gradients (HOG) feature extraction step by step.
Edge Detection with Prewitt Operator
So as we discussed, the first step is to detect gradients between pixels. The first idea which comes into mind is to use a direct forward or backward subtraction, which can be written like this :
To find the horizontal change (Gx) at a pixel I(x,y):
$$\begin{align} G_x(x,y) &= I(x+1,y)−I(x,y) \hspace{15mm} (\text{forward difference}) \\ G_x(x,y) &= I(x,y)−I(x-1,y) \hspace{15mm} (\text{backward difference}) \end{align}$$
Similarly, for the vertical change (Gy):
$$\begin{align} G_y(x,y) &= I(x,y+1)−I(x,y) \hspace{15mm} (\text{forward difference}) \\ G_y(x,y) &= I(x,y)−I(x,y-1) \hspace{15mm} (\text{backward difference}) \end{align}$$
This direct subtraction gives us a basic idea of how intensity is changing.
But we have a more robust approximation. Instead of immediate difference, we will now calculate the central difference. Instead of comparing a pixel to its single neighbor, the central difference looks at the neighbors on both sides of the pixel.
$$G_y(x,y) = I(x+1,y)−I(x-1,y)$$
By involving two neighbors, it's marginally less susceptible to random fluctuations in a single pixel. However in practice, we have a much better way of doing this computation efficiently.
Filters. Prewitt and Sobel filters to be specific. They are carefully designed to perform a weighted sum of central differences over a small neighborhood (typically 3×3 pixels).
To demonstrate, let's consider the standard Prewitt kernel for horizontal change (often denoted as \(G_x\) and used for vertical edge detection):
$$\begin{bmatrix} -1 & -1 & -1 \\ 0 & 0 & 0 \\ 1 & 1 & 1 \end{bmatrix}$$
When this kernel slides over an image (a process called convolution), for a pixel \(I(x,y)\), it calculates the following:
$$\begin{gather} G_x(x,y) =[I(x+1,y−1)−I(x−1,y−1)] \\ \hspace{2mm} + \space [I(x+1,y)−I(x−1,y)] \\ \hspace{25mm} + \space [I(x+1,y+1)−I(x−1,y+1)] \end{gather}$$
As you can see, the filter effectively computes central differences for each row in the 3x3 window and then sums them up.
Note: The key difference: The weights in their kernels. Sobel filters use slightly different weights (e.g.,
[-1, -2, -1]
instead of[-1, -1, -1]
) which give more importance to the central row/column. This often results in slightly smoother and more accurate gradient estimates, especially in noisy images.
We use Prewitt in our implementation since it is more intuitive and efficient, but since speed is not an issue, Sobel will provide us better results (as you will see)
def preprocessing_hog(img):
kernelx = np.array([[1,1,1],[0,0,0],[-1,-1,-1]])
kernely = np.array([[-1,0,1],[-1,0,1],[-1,0,1]])
img_prewittx = cv.filter2D(img, -1, kernelx)
img_prewitty = cv.filter2D(img, -1, kernely)
grad_magnitudes = np.sqrt(np.square(img_prewittx) + np.square(img_prewitty))
grad_orientations = np.degrees(np.arctan2(img_prewitty, img_prewittx)) % 180
return grad_magnitudes, grad_orientations
As explained before, the kernels represent Sobel filter, the cv.filter2D function convolves the image and the kernel, returning the gradient map in the output. Then we use the general formulae for calculating magnitude and orientation through gradients :
$$\begin{align} \left | G \right | &= \sqrt{G_x^2 + G_y^2} \\ \theta &= \arctan(G_y, G_x) \times \frac{180}{\pi} \pmod{180} \end{align}$$
The \(\frac{180}{\pi}\) multiplication is done to convert it in the result in degrees which is done by np.degrees()
and mod 180 is to ensure that it lies within the range of 180 so that we can assign it a bin later.
To make a histogram, we will need to count frequencies of gradients for more than one pixel. In the original implementation, this size of chosen to be 8×8, i.e a cell of 8×8 squares will be used to calculate histogram of oriented gradients. Also in the original implementation the size of the bins was decided to be 20 degrees, since that evenly divides 180 degrees, hence the number of bins is 9 by default.
Calculating Histogram for each cell
First we will initialize a 1D numpy array to store the accumulated weighted gradient magnitudes for each orientation bin. Bin size, as said before can be calculated with the help of number of bins (or you can take bin size as parameter and calculate number of bins based on that).
Now we have to do two things.
Calculate the primary bin by that the current orientation belongs to, by dividing it by bin size. For example, if
orientation
is 25 degrees andbin_size
is 20,t
would beint(25/20) = 1
.Find the center of the current bin, since this value will be later used for interpolation.
# Determine the primary bin index
main_bin_index = int(orientation / bin_size)
# Handle the edge case where orientation is exactly 180 degrees (or upper bound)
# This ensures it falls into the last bin (num_bins - 1)
if main_bin_index == num_bins:
main_bin_index = num_bins - 1
# Calculate the center of the main bin
main_bin_center = bin_size * (main_bin_index + 0.5)
Now we have to perform interpolation, which requires a little attention. The idea is to find which bin is the orientation closer to between the previous and the next one, and then share a small portion of the magnitude with that bin. By doing this interleaving, we make the model more robust to small changes in the orientation.
For e.g. If the orientation is 25, it lies bin 1 and is 5 units behind the current centre (30). Therefore, the weight becomes (orientation - (t * bin_size)) / bin_size
which is (25-20)/20 that is 0.25.
Therefor, 25% of the gradient magnitude goes to the previous bin and 75% of the magnitude goes to the current bin .
Let’s take another example. This time, the orientation is 37 degrees, which is more than the current center. Therefore, more portion should go to the next bin.
So this time, the distance is calculated with respect to the starting point of the next bin which is 40. Therefore, ((t + 1) * bin_size - orientation) / bin_size
which is (40-37)/20 that amounts to 0.15. So 15% of the portion goes to the next bin and 85% of the magnitude goes to the current bin.
Due to its closeness, a portion of magnitude is shared with the next bin (instead of the previous bin like in the previous example) |
Here is the complete function:
def histogram_calc(grad_magnitudes, grad_orientations, cell_size = 8, num_bins = 9):
histogram = np.zeros(num_bins)
bin_size = 180 / num_bins
for i in range(cell_size):
for j in range(cell_size):
magnitude = grad_magnitudes[i][j]
orientation = grad_orientations[i][j]
main_bin_index = int(orientation / bin_size)
if main_bin_index == num_bins:
main_bin_index = num_bins - 1
main_bin_center = bin_size * (main_bin_index + 0.5)
if orientation < main_bin_center:
weight_for_previous_bin = (orientation - (main_bin_index * bin_size)) / bin_size
weight_for_main_bin = 1 - weight_for_previous_bin
#Assign magnitude portions in the histogram array
histogram[main_bin_index] += magnitude * weight_for_main_bin
if main_bin_index > 0: # Ensure previous bin exists
histogram[main_bin_index - 1] += magnitude * weight_for_previous_bin
else:
weight_for_next_bin = ((main_bin_index + 1) * bin_size - orientation) / bin_size
weight_for_main_bin = 1 - weight_for_next_bin
histogram[main_bin_index] += magnitude * weight_for_main_bin
if main_bin_index < num_bins - 1: # Ensure next bin exists
histogram[main_bin_index + 1] += magnitude * weight_for_next_bin
return histogram
Extracting Block level features and Normalization
AS discussed, after extracting cell level features/histogram HOG computes histograms for each block comprising of 4 cells (in 2×2 fashion), concatenates the features and normalizes them.
Why normalize? Because:
Lighting changes can make gradients stronger/weaker in different regions.
Normalizing per block ensures features are comparable across different lighting conditions.
def feature_extraction_block(grad_magnitudes, grad_orientations, cell_size=8, block_size=2):
histograms = [
histogram_calc(
grad_magnitudes[i:i+cell_size, j:j+cell_size],
grad_orientations[i:i+cell_size, j:j+cell_size],
cell_size)
for i in range(0, cell_size*block_size, cell_size)
for j in range(0, cell_size*block_size, cell_size) ]
feature = np.concatenate(histograms)
feature_norm = np.sqrt(np.sum(feature**2) + 1e-7**2) # L2 norm (+ small epsilon)
normalized_feature = feature / feature_norm
return normalized_feature
Now is the time to combine all our spices. First we will calculate the number of cells across x and y axes using integer division. Then we will find the number of blocks across both axes using n_blocks = n_cells - block_size + 1
. The logic behind this would be very intuitive if you have read about CNNs before.
The blocks are overlapping with each other. Since you can start your block at position 0, 1, ..., up to (n_cells - block_size), the number of possible starting positions is (n_cells - block_size + 1). To explain this, we will use this image (1D array is taken for simplicity of explanation).
As you can see the blocks are overlapping. If that had not been the case, then we would’ve used integer division like we did for cells. But here, we can use all starting points, except some of the last ones which will cause overflow. |
If the block size had been 3 then we would’ve been able to use all except the last two rows for starting points of block. This shows that we can generalize the number of rows we have to leave to block_size - 1. |
Therefore, the number of blocks becomes : |
Rest is easy, we have to use the feature_extraction_block
method we just created and choose the specific grad_magnitudes
and grad_orientations
for that block as arguments through indexing.
def extract_hog_features(grad_magnitudes, grad_orientations, cell_size=8, block_size=2):
height, width = grad_magnitudes.shape
n_cells_y = height // cell_size
n_cells_x = width // cell_size
n_blocks_y = n_cells_y - block_size + 1
n_blocks_x = n_cells_x - block_size + 1
block_features = [
feature_extraction_block(
grad_magnitudes[y*cell_size:(y+block_size)*cell_size, x*cell_size:(x+block_size)*cell_size],
grad_orientations[y*cell_size:(y+block_size)*cell_size, x*cell_size:(x+block_size)*cell_size],
cell_size, block_size)
for y in range(n_blocks_y)
for x in range(n_blocks_x) ]
return np.concatenate(block_features)
This concludes our HOG implementation, now is the time to test it on our go-to dataset. The AT&T face dataset (which we used before for testing a similar feature extraction task using SVD) is applicable here as well, since one of the use cases of HOG is face-recognition.
It’s usual coding from here, load the data, fix a seed, create a split function for training and testing set and apply same preprocessing on both. I won’t bother you with regular stuff since the complete code is available in the colab notebook anyways. So we will discuss the results straight away.
We obtained 92% accuracy with Prewitt filters, and as expected a 3% increase in accuracy with Sobel filters.
Thank you for reading the blog till here.
References
Complete Code : Kaggle Notebook
Referenced Videos:
BONUS: Gamma Correction
If you actually look into the code, you will find that we have created a method for gamma correction and used it as a part of preprocessing. But what is it?
The thing is, that the way our eyes perceive light is way different than the way a digital camera does. Digital cameras perceive light in a linear fashion, i.e, if the number of photons of light source is doubled, they will feel the luminance to be doubled (i.e in a linear fashion). But human eyes perceive light logarithmically (we notice more differences in dark tones than bright ones).
So when we are storing the images captured from digital cameras, we employ a Gamma Encoding which employs by itself and adjust the lighting and contrast of the image so that it seems more natural to human eye. This Gamma encoding is done by using the formula :
$$V_{out}=V_{in}^{\large \gamma}$$
where:
\(V_{in}\) = Input pixel value (normalized to [0, 1]).
\(V_{out}\) = Corrected output value.
\(\gamma\) = Gamma value (0.45 for encoding)
But when we are performing some sort of image processing digitally, we need to convert it back to its original camera form and for that, we need to do a Gamma Correction.
Gamma correction follows the same formula, we just find a gamma which reverses the effects of the previous value and makes the overall graph linear. Turns out, that value is 2.2.
def gammaCorrection(img, gamma = 2.2):
img_float = img.astype(np.float32) / 255.0
corrected = np.power(img_float, gamma)
corrected = (corrected*255).astype(np.uint8)
return corrected
Here in this function, we first normalize the image, then apply the power law with predefined value of 2.2 and then scale it back up. This method is used just before equalizing the image with equalizeHist()
method in the Data Loading section of the code.
Subscribe to my newsletter
Read articles from Ayush Saraswat directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Ayush Saraswat
Ayush Saraswat
Aspiring Computer Vision engineer, eager to delve into the world of AI/ML, cloud, Computer Vision, TinyML and other programming stuff.