Multimodal Large Language Model

Why is there a need for multimodality?

The need for sophisticated technologies that can comprehend and process the diversity of information that exists in today's digital age—text, images, audio, and video—is critical. The key to overcoming this difficulty is multimodality, or a system's capacity to handle several data types at once. Industries ranging from healthcare to gaming are increasingly recognizing the importance of multimodal solutions.

Multimodal models offer a number of benefits, including the capacity to analyze data from several sources at once, which broadens the picture and facilitates greater context and self-learning. Due to the shortcomings of single-data-type models, this increased comprehension results in increased accuracy, better decision-making, and the capacity to take on tasks that were previously difficult. To achieve more accurate and contextually rich results, multimodal models can take advantage of both the complexities of visual content and the subtleties of natural language, which traditional AI models might find difficult to handle separately.

What all Data Modalities are there

Text
Audio
Video
Image
Depth (3D)
Thermal (Infrared Radiation)
Internal Measurement Unit

How various modalities are configured - Contrastive Learning

Contrastive learning performs exceptionally well in situations where labeled data is either nonexistent or very scarce. Since the model is trained to identify patterns without explicit labels, this makes it especially useful for self-supervised learning. Contrastive learning can be used in supervised situations where labels are available as well, providing a useful alternative to more conventional approaches. It is not, however, restricted to unlabeled data.

The ideas of anchors, positives, and negatives form the basis of contrastive learning. Consider an anchor as a point of reference—a specific piece of information that you are concentrating on. Positives and the anchor are comparable in that they have similar traits or fall into the same group. Negatives act as contrasts to the anchor and are clearly different from the anchor. The objective is to move negatives farther away from the anchor and positives closer to it in a multidimensional space called the embedding space.

For example, a picture of a cat could serve as an anchor. Negative images could be something unrelated to cats, like a dog, but positive images might be of other cats, maybe in various colors or positions. The learning algorithm's job is to modify these images' embedding space representation so that, while photos of dop or unrelated images stay far apart, photographs of cats cluster together (positively).

One Database for Multimodal Data

Storing the multimodal data into the embedding form into the one database helps to provide more personalization and flexibility to the user and opens up new path for the usecases.

This helps to build the Any to Any Search Applications capabilities. It focuses on showcasing how any of the modalities that the multimodal model understands and embeds may be sent in as a query and can return objects of any modality that share conceptual similarities. We will be discussing on the Hands-on way to build any to any search application in the upcoming blog of the series.

Let's Play with Contrasive Learning

Importing the necessary libraries -

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from sklearn.decomposition import PCA

Some, Libraries for the plotting

import umap
import umap.plot
import plotly.graph_objs as go
import plotly.io as pio

Importing, more modules which will be needed for the construction of model

import torch
from torch import nn, optim
from torch.utils.data import DataLoader
from torchvision import transforms

Now, we will setup the MNIST script to transform the dataset

import pandas as pd
from torch.utils.data import Dataset
import numpy as np
from tqdm import tqdm

class MNISTDataset(Dataset):
    def __init__(self, data_df: pd.DataFrame, transform=None, is_test=False):
        # For the intiaition of the function
        super(MNISTDataset, self).__init__()
        dataset = []
        labels_positive = {}
        labels_negative = {}
        if is_test == False:
            for i in list(data_df.label.unique()):
                labels_positive[i] = data_df[data_df.label == i].to_numpy()
            # for each label create a set of image of various label
            for i in list(data_df.label.unique()):
                labels_negative[i] = data_df[data_df.label != i].to_numpy()

        for i, row in tqdm(data_df.iterrows(), total=len(data_df)):
            data = row.to_numpy()

            if is_test:
                label = -1
                first = data.reshape(28, 28)
                second = -1
                dis = -1
            else:
                # label and image of the index for each row in df
                label = data[0]
                first = data[1:].reshape(28, 28)
                # probability of same label image == 0.5
                if np.random.randint(0, 2) == 0:
                    # randomly select same label image
                    second = labels_positive[label][
                        np.random.randint(0, len(labels_positive[label]))
                    ]
                else:
                    # randomly select different label 
                    second = labels_negative[label][
                        np.random.randint(0, len(labels_negative[label]))
                    ]
                # cosine is 1 for same and 0 for different label
                dis = 1.0 if second[0] == label else 0.0
                second = second[1:].reshape(28, 28)

            # apply transform on both images
            if transform is not None:
                first = transform(first.astype(np.float32))
                if second is not -1:
                    second = transform(second.astype(np.float32))
            # this random list is created once and used in every epoch
            dataset.append((first, second, dis, label))

        self.dataset = dataset
        self.transform = transform
        self.is_test = is_test

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, i):
        return self.dataset[i]

You can download the training and the testing dataset of the MNIST from here -

df = pd.read_csv('digit/train.csv')
val_count = 1000
default_transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.ToTensor(),
    transforms.Normalize(0.5, 0.5)
])

#Setting up the training and the validation set
dataset = MNISTDataset(df.iloc[:-val_count], default_transform)
val_dataset = MNISTDataset(df.iloc[-val_count:], default_transform)

Now, Setting up the dataloader


trainLoader = DataLoader(
    dataset,
    batch_size=16,
    shuffle=True,
    pin_memory=True,
    num_workers=2,
    prefetch_factor=100
)

valLoader = DataLoader(val_dataset,
    batch_size=64,
    shuffle=True,
    pin_memory=True,
    num_workers=2,
    prefetch_factor=100
)

The training loader uses a smaller batch size (16) compared to the validation loader (64). This is common because training often benefits from more frequent updates, while validation can process larger batches at once for efficiency.

Both use 2 worker processes (num_workers=2) for parallel data loading, which can speed up the data pipeline. The prefetch_factor=100 is quite high, which means a lot of data is being preloaded. This can be beneficial if you have enough memory, but you might want to adjust this based on your system's capabilities.

Let's visualize and then observe the anchor and the respective +/- example

import matplotlib.pyplot as plt
import numpy as np

def show_images(images, title='', num_display=4):
    fig, axes = plt.subplots(1, num_display, figsize=(12, 3))
    fig.suptitle(title, fontsize=16)

    for i, ax in enumerate(axes):
        img = np.squeeze(images[i])
        ax.imshow(img, cmap='gray')
        ax.axis('off')

    plt.tight_layout()
    plt.show()

iterator = iter(trainLoader)
anchor_images, contrastive_images, distances, labels = next(iterator)

# Convert to numpy and move to CPU if necessary
anchor_images = anchor_images.cpu().numpy()
contrastive_images = contrastive_images.cpu().numpy()
labels = labels.cpu().numpy()

# Display images
show_images(anchor_images, title='Anchor Images')
show_images(contrastive_images, title='+/- Examples')

Result for the previous snippet -

+/- Examples are the contrasive images.

Let's Build the Neural Network

class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()
        self.conv1 = nn.Sequential(
            nn.Conv2d(1, 32, 5),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d((2, 2), stride=2),
            nn.Dropout(0.3)
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(32, 64, 5),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d((2, 2), stride=2),
            nn.Dropout(0.3)
        )
        self.linear1 = nn.Sequential(
            nn.Linear(64 * 4 * 4, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.3),
            nn.Linear(512, 64),
        )

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x) 
        x = x.view(x.size(0), -1) 
        x = self.linear1(x) 
        return x

The network consists of two convolutional layers followed by a fully connected layer.

Conv1 - 1 input channel → 32 output channels, 5x5 kernel
Conv2 - 32 input channels → 64 output channels, 5x5 kernel
Input - 64x4x4 = 1024 features
Hidden layer - 512 neurons with ReLU activation and Dropout Output - 64 dimensional embedding ReLU (Rectified Linear Unit) is used throughout the network inplace=True is used for memory efficiency.
Thus, 64 dimensional embedding is suitable for tasks like contrastive learning or similarity-based comparisons.

class ContrastiveLoss(nn.Module):
    def __init__(self):
        super(ContrastiveLoss, self).__init__()
        self.similarity = nn.CosineSimilarity(dim=-1, eps=1e-7)

    def forward(self, anchor, contrastive, distance):
        score = self.similarity(anchor, contrastive)
        return nn.MSELoss()(score, distance)

The cosine similarity is used to measure the similarity between anchor and contrastive embeddings which is a custom function. Cosine similarity is scale invariant and measures the angle between vectors which helps to distant as per the +/- for the related and unrelated characteristic of the image respectively. Mean Squared Error (MSE) is used to compare the computed similarity with the target distance. The loss encourages similar pairs to have high cosine similarity (close to 1) and dissimilar pairs to have low similarity (close to 0).

optimizer = optim.Adam(net.parameters(), lr=0.001)
loss_function = ContrastiveLoss()
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.4)

The purpose of the above function is to gradually reduce the learning rate during training. It also works on improving the convergence. For the first 7 epochs, the initial learning rate is used. Then, after 7 epochs, the learning rate becomes 30% of it's previous value. The pattern continues every 7 epochs. The advantage of the process is that it helps for the precise optimization. The step size and gamma values may be changed or tuned as per the specific dataset used, so trail and test is the only option.

def model(epoch_count=100):
    net = Network().to(device)
    optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

    lrs = []
    losses = []
    best_loss = float('inf')
    best_model = None

    for epoch in range(epoch_count):
        net.train()
        epoch_loss = 0
        batches = 0

        print(f'Epoch {epoch+1}/{epoch_count}')
        lrs.append(optimizer.param_groups[0]['lr'])
        print(f'Learning rate: {lrs[-1]:.6f}')

        for anchor, contrastive, distance, _ in tqdm(trainLoader, desc="Training"):
            batches += 1
            optimizer.zero_grad()

            anchor_out = net(anchor.to(device))
            contrastive_out = net(contrastive.to(device))
            distance = distance.to(torch.float32).to(device)

            loss = loss_function(anchor_out, contrastive_out, distance)
            epoch_loss += loss.item()

            loss.backward()
            optimizer.step()

        avg_loss = epoch_loss / batches
        losses.append(avg_loss)
        scheduler.step()

        print(f'Epoch loss: {avg_loss:.4f}')

        # Save the best model
        if avg_loss < best_loss:
            best_loss = avg_loss
            best_model = net.state_dict()
            print(f'New best model saved with loss: {best_loss:.4f}')

        # Save a checkpoint every 10 epochs
        if (epoch + 1) % 10 == 0:
            checkpoint_path = os.path.join(checkpoint_dir, f'model_epoch_{epoch+1}.pt')
            torch.save(net.state_dict(), checkpoint_path)
            print(f'Checkpoint saved at epoch {epoch+1}')

    # Save the best model at the end of training
    best_model_path = os.path.join(checkpoint_dir, 'best_model.pt')
    torch.save(best_model, best_model_path)
    print(f'Best model saved with loss: {best_loss:.4f}')

    return {
        "net": net,
        "losses": losses,
        "best_loss": best_loss
    }

The model() initializes a network, optimizer, and learning rate scheduler. The function runs for a 100 number of epochs, processing batches of data. In each epoch, it computes losses using anchor and contrastive samples, performs backpropagation, and updates the model. It tracks and saves the best performing model based on the lowest average loss. The function saves checkpoints every 10 epochs and the best model at the end of training. It returns the trained network, loss history, and best loss achieved.

import os

# Directory to save the checkpoints
checkpoint_dir = 'checkpoints/'

# Ensure directory exists
if not os.path.exists(checkpoint_dir):
    os.makedirs(checkpoint_dir)

# To call the model() function and start the training
training_result = model()
model = training_result["net"]

During the training, each epochs would take some time to get trained on and you can track the loss. Setting the Google Colab to the T4 GPU Configuration would save a lot of time and the process of training each epoch reduces dramatically.

# The losses are accessed from the training_result dictionary with 'losses'
plt.plot(training_result["losses"])
plt.show()

This plot() showcases successful model training, showing rapid initial improvement followed by gradual refinement. The model appears to be learning effectively, with no obvious signs of overfitting or instability in the training process.

outputs = []
labels = []
net.eval()
# loop over train dataset and get embedding for each image
with torch.no_grad():
    for first, second, dis, label in tqdm(trainLoader):
        # Appends the embeddings and labels to their respective lists
        outputs.append(net(first.to(device)).cpu().detach().numpy())
        labels.append(label.numpy())

# Converting the list into the numpy array
outputs = np.concatenate(outputs)
labels = np.concatenate(labels)

To Generate the 64 Dimension of the MNIST Training Set

encoded_data = []
labels = []

with torch.no_grad():
    for anchor, _, _, label in tqdm(trainLoader):
        output = model(anchor.to(device))
        encoded_data.extend(output.cpu().numpy())
        labels.extend(label.cpu().numpy())

# Both outputs and labels are converted to numpy arrays.
encoded_data = np.array(encoded_data)
labels = np.array(labels)

torch.no_grad() context to disable gradient computation, which is appropriate for inference. It iterates through the trainLoader, but only uses the anchor images and labels.

For each batch -

Passes the anchor images through the model to get embeddings.
Extends the encoded_data list with the model outputs (embeddings).
Extends the labels list with the corresponding labels.

Seaborn Plot for Embedding

The utility function plot_umap uses UMAP to reduce the dimensionality of the encoded data to 2D, using cosine distance as the metric. The resulting 2D embeddings are combined with labels into a pandas DataFrame. It sets up a matplotlib figure and applies seaborn styling.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

def plot_umap(encoded_data, labels, random_state=42):
    # Create UMAP embeddings
    mapper = umap.UMAP(random_state=random_state, metric='cosine')
    umap_embeddings = mapper.fit_transform(encoded_data)

    # Create a DataFrame for easy plotting
    df = pd.DataFrame({
        'UMAP1': umap_embeddings[:, 0],
        'UMAP2': umap_embeddings[:, 1],
        'Label': labels
    })

    # Set up plot style
    plt.figure(figsize=(12, 10))
    sns.set_style("whitegrid")
    sns.set_palette("deep")

    # Create scatter plot
    sns_plot = sns.scatterplot(
        data=df,
        x='UMAP1',
        y='UMAP2',
        hue='Label',
        palette='deep',
        legend='full',
        alpha=0.7
    )

    # Customize the plot
    plt.title('UMAP Projection of Encoded Data', fontsize=16)
    plt.xlabel('UMAP1', fontsize=12)
    plt.ylabel('UMAP2', fontsize=12)
    plt.legend(title='Labels', title_fontsize='13', fontsize='11', loc='center left', bbox_to_anchor=(1, 0.5))

    # Adjust layout
    plt.tight_layout()
    plt.show()

plot_umap(encoded_data, labels)

Here, You can easily distinguish between the digit with the various cluster. Each cluster here, represents one of the label from the MNIST Dataset.

You Can Refer to the Github Repository for the Data and the Code file - https://github.com/Hrishikesh332/Large-Multimodel-Guide/tree/main/M1%20Overview%20Multimodality%20MNIST

If you're looking for the Train and Test Dataset of the MNIST, you can refer to - https://www.kaggle.com/competitions/digit-recognizer

Research Paper:

A Simple Framework for Contrastive Learning of Visual Representations: https://arxiv.org/pdf/2002.05709v3
Multimodal Contrastive Training for Visual Representation Learning: https://openaccess.thecvf.com/content/CVPR2021/html/Yuan_Multimodal_Contrastive_Training_for_Visual_Representation_Learning_CVPR_2021_paper.html
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts: https://proceedings.neurips.cc/paper_files/paper/2022/hash/3e67e84abf900bb2c7cbd5759bfce62d-Abstract-Conference.html
Understanding Multimodal Contrastive Learning and Incorporating Unpaired Data: https://proceedings.mlr.press/v206/nakada23a.html
Identifiability Results for Multimodal Contrastive Learning: https://arxiv.org/abs/2303.09166
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning: https://proceedings.neurips.cc/paper_files/paper/2022/hash/702f4db7543a7432431df588d57bc7c9-Abstract-Conference.html

Chapter 2 - Hands on with the World of Multimodality

Table of contents