Neural networks have proven to be great tools to generalize functions and patterns by learning from data. They have been used extensively in various fields ranging from spelling checkers to medical imaging. If you are reading this, if might already have experience with neural networks. But, did you know that a single model can be trained to predict multiple outputs? In this blog post, I will be guiding you step-by-step to build such a neural network in PyTorch.

What we will build

This article will cover the following steps in understanding and building a multi-task neural network:

Building a CNN with multiple heads to classify images (is_sky and is_not_sky) as well as to predict the AQI based on the image
Designing a loss function for the multi task CNN
Visualizing model performance using various metrics

The full code for the model training along with the final model checkpoint can be found in the aqi-prediction-from-images GitHub repository.

Disclaimer

This article assumes that you have an understanding of deep learning and Convolutional Neural Networks along with how to use PyTorch to build them. The blog post will not go over minor details and PyTorch syntax. If you are unfamiliar with PyTorch, I recommend you go through the Learn the Basics section of the PyTorch documentation.

Introduction to Multi Task Neural Networks

(Image Source: Multi-Task Learning with Deep Neural Networks, medium.com, Kajal Gupta)

Imagine you see an image of a person you have never met before. You can guess their gender as well as their age just by taking a look at that single image. Just like our brains can make multiple observations from a single stimulus, we can also use a single neural network to make multiple predictions from a single input. To do this, we have have a shared number of layers in the beginning, which learns the low-level features from our dataset, and then create separate “pathways“ for more specific tasks.

This can be easily visualized in the image above. We have shared layers in the beginning and the network is split to predict two separate flower characteristics: type and color. Having such shared layers can have multiple advantages like computational efficiency, less number of neural networks and faster predictions. Multi task neural networks are a great tool if you want to make multiple predictions from the same dataset, at the same time.

Our Multi task neural network

In this blog post, we will build a neural network to predict AQI of a location based on the image of the sky provided by a user. This neural network is going to be part of a larger system that will allow the public and officials to visualize AQI levels in any given location. In order to make sure that our system does not predict and AQI from unrelated images, we will also need to classify if the image supplied is of a sky or not.

This task is perfect to implement a multi task neural network as we are making two predictions (classification and regression) from the same input. Since, we will be working with image data, we will have a shared encoder with convolutional layera and then we will separate the final feature maps to task-specific dense layers.

Dataset Overview

To perform the AQI prediction from images, we shall be using the “Air Pollution Image Dataset from India and Nepal” which is available on Kaggle. The dataset is comprehensive and contains images 12,240 from both Nepal and India along with the City Name, Date, Time, AQI, PM2.5, PM10, O3, CO, SO2 and NO2 levels associated with each image. For our purposes, we only care about the filename and associated AQI levels. So, we will only focus on that.

# Only need image name and AQI
df = pd.read_csv(CSV_DIR)
df = df[["Filename", "AQI"]]
df = df.rename(columns={"Filename": "filename", "AQI": "aqi"})

So, is that all? Well, this data is enough is we just want to predict an AQI value based on the image provided, but we also need to classify if the provided image is of a sky or not. Our current dataset only has images of the sky and so we cannot use just this to teach our model what is “not a sky”.

You may choose any image dataset that has a lot of “non sky“ images. I personally chose the Flickr30k dataset that I found on Kaggle. The dataset is originally used to train image captioning models, but given that the dataset has a huge variety of different types of images, I found it perfect to use as my “negatives“ dataset. In order to ensure my model is not bias towards predicting “not a sky“, I only used 12,240 out of the 31,783 images this dataset provides. Also, the AQI values for all the negative images was set to -1.

negative_images = negative_images[:len(df)]
negative_df = pd.DataFrame({
    "filename": negative_images,
    "aqi": -1
})

combined_df = pd.concat([df, negative_df], ignore_index=True)
# Randomly shuffle dataframe
combined_df = combined_df.sample(frac=1, random_state=42).reset_index(drop=True)

Create the loss function

If we were only creating a single task neural network, we could go about using off the shelf loss functions. But, you will have to do a bit more work when you are building a multi task neural network. In our case, we want to optimize for both the classification and regression tasks. One simple way to do is to simply combine the two losses with appropriate weights. To calculate the regression loss, we will be using MSE loss and for the classification loss, we shall use Cross Entropy loss.

We will also assign appropriate weights: alpha for the classification loss and beta for the regression loss. Through experimentation I found that the classification task converges much faster than regression task, so alpha was set to 0.1 and beta to 0.9 which gives more priority to the regression loss.

class MultiDomainLoss(nn.Module):
    def __init__(self, alpha=0.1, beta=0.9):
        super().__init__()
        self.classification_loss = nn.CrossEntropyLoss()
        self.regression_loss = nn.MSELoss(reduction='none')
        self.alpha = alpha
        self.beta = beta

    def forward(self, pred, target):
        is_sky_logits, aqi_pred = pred
        is_sky_target, aqi_target = target
        classification_loss = self.classification_loss(is_sky_logits, is_sky_target)
        _, predicted_classes = torch.max(is_sky_logits, dim=1)
        correct_mask = (predicted_classes == is_sky_target).float()
        per_sample_reg_loss = self.regression_loss(aqi_pred, aqi_target)
        masked_reg_loss = per_sample_reg_loss * correct_mask

        # Calculate average regression loss for correct samples
        correct_count = torch.sum(correct_mask)
        if correct_count > 0:
            regression_loss = torch.sum(masked_reg_loss) / correct_count
        else:
            regression_loss = torch.tensor(0.0, device=is_sky_logits.device)

        return self.alpha * classification_loss + self.beta * regression_loss

Also, we are only calculating the regression loss if the predicted class is the correct class. This will ensure that we are not penalizing the model for incorrect AQI guesses on non-sky images.

Building the model

For the most part, the model is simple your run of the mill convolutional neural network with a encoder section that has convolution layers and a dense head to make predictions, only in our case we will have two separate dense layers.

As seen in the architecture, the encoder layers are shared and we only separate out the final task-specific dense layers. One interesting thing you might see is the use of two activation functions. GeLU is used for image encoding and classification as these layers will have to work with a lot more noisy patterns. ReLU is used for the regression layers as the output is linear in nature and will always be greater than 0.

The PyTorch implementation of this architecture is given below:

class MultiTaskCNN(nn.Module):
    def __init__(self):
        super().__init__()
        # Shared Encoder
        self.encoder = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1),  # [B, 3, H, W] → [B, 16, H, W]
            nn.BatchNorm2d(16),
            nn.GELU(),
            nn.MaxPool2d(2),  # ↓↓ resolution

            nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1),  # [B, 32, H/2, W/2]
            nn.BatchNorm2d(32),
            nn.GELU(),
            nn.MaxPool2d(2),  # ↓↓ resolution again

            nn.Conv2d(32, 32, kernel_size=3, stride=1, padding=1),  # [B, 32, H/2, W/2]
            nn.BatchNorm2d(32),
            nn.GELU(),

            nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),  # [B, 64, H/4, W/4]
            nn.BatchNorm2d(64),
            nn.GELU(),
            nn.AdaptiveAvgPool2d((1, 1))  # → [B, 64, 1, 1]
        )

        # Classification Head
        self.classifier = nn.Sequential(
            nn.Flatten(),       # [B, 64]
            nn.Linear(64, 32),
            nn.GELU(),
            nn.Dropout(0.5),
            nn.Linear(32, 2)    # binary classification logits
        )

        # Regression Head
        self.regressor = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 8),
            nn.ReLU(),
            nn.Linear(8, 1)    # output AQI as a float
        )

    def forward(self, x):
        features = self.encoder(x)

        is_sky_logits = self.classifier(features)
        aqi_pred = self.regressor(features).squeeze(1)  # [B]

        return is_sky_logits, aqi_pred

As seen in the forward method, we have supplied the same features to two different dense layers, and we return the predictions made by both layers.

Training and Evaluation

The training hyper parameters were set as follows

Hyperparameter	Value
Optimizer	Adam
Learning Rate	1e-4
Weight Decay	1e-5
Batch Size	16

The model was set to train for 100 epochs, but since no further improvement was observed, the model in the 93rd epoch was deemed the best and was used. Like mentioned before, since the classification task seemed to converge much quicker, the MAE of the regression prediction was used as a criteria to choose the best model and for early stopping.

Also, we closely monitored the accuracy of the classification and MAE of the regression. Both these metrics showed improvement with training.

The final model metrics are as follows:

Metric	Value
Accuracy (Classification)	0.9841
Precision (Classification)	0.9752
Recall (Classification)	0.9930
F1 Score (Classification)	0.9840
Mean Absolute Error	26.2747
Mean Squared Error	1216.0543
Root Mean Squared Error	34.8720

The classification performance of the model is near perfect. It may be due to the very simple nature of the classification with very distinctive features in sky and non-sky images. In the regression task, the average MAE is 26.27 which is ~6% error given the range of AQI in our training dataset. These metrics are extremely good given the difficult task of predicting AQI from an image.

The AQI predictions seems to bounce around quite a while, but this performance is good enough given the herculean task we had in hand.

Conclusion

Congratulations! You have reached the end of this blog post. I hope that this post was helpful to you. If you hope to read more posts like this, consider following me on Hashnode, and joining the newsletter.

If you have any queries or suggestions, please leave a comment or reach out to me directly by sending an email to mukul.development@gmail.com.

Building a Multi-Task Neural Network for Image Based AQI Prediction

Table of contents