Object Localization

the process identifying and locating objects within an image

Detection: identifying whether an object of interest is present in the image
- Bounding Box: draw a rectangular box around the detected object, determined by parameters
  - (b_x): the x-coordinate of the center of the bounding box
  - (b_y): the y-coordinate of the center of the bounding box.
  - (b_h): the height of the bounding box
  - (b_w): the width of the bounding box

Non-max Suppression

Non-Max Suppression (NMS) is a technique used in object detection to eliminate redundant bounding boxes when the same object is detected across multiple grid cells. These overlapping boxes may have varying confidence scores. NMS retains only the bounding box with the highest confidence score (probability) while discarding the others. This process helps in reducing duplicate detections and ensures that each object is represented by a single, most accurate bounding box.

When applying non-max suppression, the process typically involves the following steps:

Sort the Bounding Boxes: First, you sort all the predicted bounding boxes based on their confidence scores in descending order.
Select the Highest Score: You start with the bounding box that has the highest confidence score and keep it.
Calculate IoU: For the remaining boxes, you calculate the Intersection over Union (IoU) with the selected box.
Thresholding: If the IoU of any remaining box with the selected box exceeds a certain threshold (commonly 0.5), you discard that box.
Repeat: You repeat this process for the next highest confidence box until all boxes have been processed.

This method ensures that you retain the most confident detection while eliminating redundant boxes that may overlap significantly.

Anchor Boxes

Anchor Boxes are predefined bounding boxes of various shapes and sizes used in object detection algorithms to help detect multiple objects within a single grid cell. They provide a reference for predicting object locations and scales, improving the model’s ability to handle objects of different aspect ratios.

Intersection over Union (IoU)

IoU is a metric used to evaluate the accuracy of an object detector on a particular dataset. It measures the overlap between the predicted bounding box and the ground truth bounding box. The formula for IoU is:

$$\text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}}$$

Where:

Area of Overlap is the area covered by both the predicted and ground truth bounding boxes.
Area of Union is the total area covered by both bounding boxes combined.
If IoU is high (close to 1) → The predicted box and ground truth box almost perfectly overlap.
If IoU is low (close to 0) → The predicted box and ground truth box barely overlap or don’t overlap at all.

Typical IoU Thresholds in Object Detection:

IoU > 0.5 → Often considered a good detection.
IoU > 0.7 → Typically used for high-confidence matches in evaluation.
IoU < 0.3 → Usually considered a bad prediction (false positive).

So, if IoU is big, the model is doing well at locating objects accurately. 🚀

The ground truth bounding box is obtained from manually labeled data in object detection datasets.

YOLO

“You Only Look Once” is a popular algorithm because it achieves high accuracy while also being able to run in real time. This algorithm "only looks once" at the image in the sense that it requires only one forward propagation pass through the network to make predictions. After non-max suppression, it then outputs recognized objects together with the bounding boxes.

Semantic Segmentation

It is not only detect the object, but also assign the pixels to the object. U-Net is designed for pixel-level classification, which is essential for tasks like semantic segmentation. However, let's expand on that a bit more to deepen your understanding.

Motivation for U-Net is due to analyze the medical images. for instance, segmenting the tumor from a image.

The U-Net architecture has several key features that differentiate it from traditional convolutional neural networks (CNNs):

Encoder-Decoder Structure: U-Net consists of an encoder (downsampling path) and a decoder (upsampling path). The encoder captures context by progressively reducing the spatial dimensions, while the decoder enables precise localization by upsampling the feature maps.
Skip Connections: U-Net employs skip connections that link corresponding layers in the encoder and decoder. This allows the model to retain spatial information lost during downsampling, which is crucial for accurately predicting pixel labels.
Symmetrical Architecture: The U-Net architecture is symmetrical, meaning that the number of feature channels doubles in the downsampling path and halves in the upsampling path. This design helps maintain a balance between context and localization.
Efficient Training: U-Net is particularly effective for training on small datasets, as it can leverage data augmentation and the skip connections to improve performance without requiring a large amount of labeled data.
Output Shape: Unlike traditional CNNs that typically output a single class label for an image, U-Net outputs a segmentation map where each pixel is classified into a specific category, making it ideal for tasks like medical image analysis, where precise localization of structures is critical.

Semantic segmentation is called so because it involves assigning a semantic label to every pixel in an image.

Transpose Convolution

In simple terms, a transpose convolution is like taking a small picture and making it bigger while adding some details. Imagine you have a small two-by-two square of colored tiles, and you want to create a larger four-by-four square. Instead of just stretching the small square, you use a special pattern (called a filter) to fill in the larger square with new colors based on the original tiles. This process helps the computer understand and generate images better by learning from the smaller input and creating a more detailed output.

U-Net Architecture Intuition

"U-Net," which is used for a task called semantic segmentation. This means that the network helps to identify and label different parts of an image, like finding where a cat is located in a picture.

In simple terms, think of the U-Net as a smart artist. First, it looks at a big picture and starts to understand it by compressing the image into a smaller version, losing some details but keeping the overall idea. Then, it uses a special technique called "transposed convolution" to expand this smaller version back to the original size. But here's the clever part: it also connects the early stages of its understanding directly to the later stages. This way, it combines the big picture with the fine details, allowing it to accurately decide if a certain pixel in the image is part of a cat or not.

U-Net Architecture

The U-Net is designed to take an image as input and produce a segmented output, which means it can identify different parts of the image, like separating a cat from the background. Imagine you have a coloring book page with a drawing of a cat. The U-Net helps you color only the cat while leaving the background white. It does this by using a series of layers that first reduce the image size to capture important features and then gradually rebuild it to the original size while adding details back in. This process is like sculpting a statue: you start with a big block of stone (the original image), chip away at it to find the shape (features), and then polish it to reveal the final artwork (segmented image).

Detection Algorithms