Fine-Tuning NanoBERT for Sentence Similarity: A Deep Learning Approach

Mohamad MahmoodMohamad Mahmood
9 min read

Fine-tuning transformer models for sentence similarity has become an essential task in natural language processing (NLP), enabling applications such as search engines, chatbots, and text summarization. In this study, we explore NanoBERT, a lightweight transformer model, and fine-tune it using both supervised and unsupervised approaches to improve sentence similarity detection. We leverage contrastive learning and Cosine Embedding Loss to train the model without labeled data, allowing it to generalize better across diverse sentence structures. This approach provides an efficient yet powerful method for generating high-quality sentence embeddings, making it ideal for real-world applications with limited computational resources.

[1] Supervised approach

import torch
import torch.nn.functional as F
import json
import random
from torch.utils.data import DataLoader, Dataset
from nano_bert.model import NanoBertForClassification
from nano_bert.tokenizer import WordTokenizer

# Load dataset to extract vocabulary
data = None
with open('nano-BERT/data/dataset.json') as f:
    data = json.loads(f.read())

vocab = set()
for post_id in data:
    vocab |= set(data[post_id]['post_tokens'])

# Initialize tokenizer with dataset vocabulary
tokenizer = WordTokenizer(vocab=vocab, max_seq_len=128)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load model
bert = NanoBertForClassification(
    vocab_size=len(tokenizer.vocab),
    n_layers=3,  # Increased number of layers for better feature extraction
    n_heads=4,  # Increased attention heads
    max_seq_len=128,
    n_classes=1  # Predicting similarity
).to(device)

# Custom dataset class for sentence similarity
class SimilarityDataset(Dataset):
    def __init__(self, sentence_pairs, labels, tokenizer):
        self.sentence_pairs = sentence_pairs
        self.labels = torch.tensor(labels, dtype=torch.float32).view(-1, 1)  # Ensure labels are 2D
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.sentence_pairs)

    def __getitem__(self, idx):
        sent1, sent2 = self.sentence_pairs[idx]
        label = self.labels[idx]

        tokens1 = torch.tensor(self.tokenizer(sent1), dtype=torch.long).clone().detach()
        tokens2 = torch.tensor(self.tokenizer(sent2), dtype=torch.long).clone().detach()

        return tokens1, tokens2, label

# Expanded dataset with more diverse and realistic sentence pairs
sentence_pairs = [
    ("The sun sets behind the mountains.", "The sky turns orange as the sun goes down."),
    ("She enjoys reading books at night.", "Before bed, she likes to read novels."),
    ("The cat is sleeping on the couch.", "A black cat naps on the living room sofa."),
    ("They went for a hike in the woods.", "The group explored the forest trails."),
    ("The pizza was delivered in 30 minutes.", "Their order arrived faster than expected."),
    ("He bought a new laptop for work.", "He purchased a high-performance computer."),
    ("The concert was loud and energetic.", "The live music event was full of energy."),
    ("She prefers coffee over tea.", "She likes drinking coffee in the morning."),
    ("The children played outside in the park.", "Kids enjoyed playing in the playground."),
    ("He forgot his umbrella on a rainy day.", "It started raining, and he had no umbrella."),
    ("The company announced a new product.", "A new item was revealed by the corporation."),
    ("She completed the marathon in record time.", "She finished the race faster than ever before."),
]

# Expand dataset by creating more variations
for _ in range(500):  # Increased dataset size further
    s1, s2 = random.choice(sentence_pairs)
    sentence_pairs.append((s1, s2))

labels = [1 if random.random() > 0.5 else 0 for _ in range(len(sentence_pairs))]  # Balanced dataset

# Normalize labels
min_label, max_label = min(labels), max(labels)
labels = [(label - min_label) / (max_label - min_label) for label in labels]

dataset = SimilarityDataset(sentence_pairs, labels, tokenizer)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)  # Increased batch size for stability

# Function to get sentence embeddings
def get_embedding(sentence, model, tokenizer, device):
    tokens = torch.tensor(tokenizer(sentence), dtype=torch.long).unsqueeze(0).to(device)  # Tokenize and add batch dimension
    with torch.no_grad():
        outputs = model(tokens)
        embedding = outputs.mean(dim=1)  # Average over token embeddings
    return F.normalize(embedding, p=2, dim=1)  # Normalize for cosine similarity

# Function to compute similarity
def compute_similarity(sentence1, sentence2, model, tokenizer, device):
    emb1 = get_embedding(sentence1, model, tokenizer, device)
    emb2 = get_embedding(sentence2, model, tokenizer, device)
    similarity = torch.cosine_similarity(emb1, emb2, dim=-1).view(1, 1)  # Ensure similarity has shape [1, 1]
    return similarity.item()

# Define optimizer and loss function
optimizer = torch.optim.AdamW(bert.parameters(), lr=5e-5, weight_decay=1e-4)  # Adjusted optimizer and learning rate
loss_fn = torch.nn.MSELoss()

# Fine-tune Nano-BERT
NUM_EPOCHS = 25  # Increased number of epochs
for epoch in range(NUM_EPOCHS):
    total_loss = 0
    for tokens1, tokens2, labels in dataloader:
        tokens1, tokens2, labels = tokens1.to(device), tokens2.to(device), labels.to(device)

        # Get embeddings
        emb1 = F.normalize(bert(tokens1).squeeze(), p=2, dim=-1)
        emb2 = F.normalize(bert(tokens2).squeeze(), p=2, dim=-1)

        # Compute cosine similarity
        similarity = torch.cosine_similarity(emb1, emb2, dim=-1).view(-1, 1)  # Ensure similarity has shape [batch_size, 1]

        # Compute loss
        loss = loss_fn(similarity, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss / len(dataloader)}")

# Example usage
sentence1 = "The weather is nice today."
sentence2 = "It is a beautiful day outside."
similarity_score = compute_similarity(sentence1, sentence2, bert, tokenizer, device)
print(f"Sentence Similarity after fine-tuning: {similarity_score}")

# Fix for tensor dimension issue in Nano-BERT model
x = torch.randint(0, 10, (2, 5))  # Example input
print(f"Shape of x before expand: {x.shape}")
mask = (x > 0).unsqueeze(1).expand(-1, x.size(1), x.size(1))
print(f"Mask shape after expand: {mask.shape}")

Output:

Epoch 1, Loss: 0.47009114641696215
Epoch 2, Loss: 0.46512508392333984
Epoch 3, Loss: 0.45530030550435185
Epoch 4, Loss: 0.4463386065326631
Epoch 5, Loss: 0.4346395470201969
Epoch 6, Loss: 0.4207792989909649
Epoch 7, Loss: 0.403972833417356
Epoch 8, Loss: 0.38960242737084627
Epoch 9, Loss: 0.3703969297930598
Epoch 10, Loss: 0.35984235163778067
Epoch 11, Loss: 0.3495531249791384
Epoch 12, Loss: 0.34105023834854364
Epoch 13, Loss: 0.3264932483434677
Epoch 14, Loss: 0.3201936432160437
Epoch 15, Loss: 0.3120799120515585
Epoch 16, Loss: 0.30296389292925596
Epoch 17, Loss: 0.29233288625255227
Epoch 18, Loss: 0.2909731031395495
Epoch 19, Loss: 0.2818500390276313
Epoch 20, Loss: 0.28352286480367184
Epoch 21, Loss: 0.2790794181637466
Epoch 22, Loss: 0.2723000799305737
Epoch 23, Loss: 0.2629678566008806
Epoch 24, Loss: 0.2677627382799983
Epoch 25, Loss: 0.2623504842631519
Sentence Similarity after fine-tuning: 1.0
Shape of x before expand: torch.Size([2, 5])
Mask shape after expand: torch.Size([2, 5, 5])

The output shows a steady decrease in loss over the 25 epochs, indicating that the model is effectively learning sentence similarities. Here are some key observations:

  1. Loss Reduction Trend: The loss starts at 0.47 in the first epoch and gradually drops to 0.26 by epoch 25. This is a good sign that the model is learning effectively.

  2. Training Stability: The decline in loss is relatively smooth, with no abrupt spikes or overfitting symptoms. This suggests that your learning rate and optimizer settings are well-tuned.

  3. Final Similarity Score: The sentence similarity after fine-tuning is 1.0, which might indicate that the model has become too confident. If your dataset is balanced, this is expected, but you might want to check for overfitting by testing on unseen data.

  4. Mask Expansion: The tensor expansion and masking appear to be functioning as intended, ensuring that sequence lengths are correctly handled.

Suggestions for Improvement:

  • Validation Loss Tracking: If possible, track validation loss to ensure generalization.

  • Test on Unseen Pairs: Check similarity on a test set to confirm real-world performance.

  • Regularization: If overfitting is a concern, consider dropout layers or L2 regularization.

[2] Unsupervised approach

import torch
import torch.nn.functional as F
import json
import random
from torch.utils.data import DataLoader, Dataset
from nano_bert.model import NanoBertForClassification
from nano_bert.tokenizer import WordTokenizer

# Load dataset to extract vocabulary
data = None
with open('nano-BERT/data/dataset.json') as f:
    data = json.loads(f.read())

vocab = set()
for post_id in data:
    vocab |= set(data[post_id]['post_tokens'])

# Initialize tokenizer with dataset vocabulary
tokenizer = WordTokenizer(vocab=vocab, max_seq_len=128)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load model
bert = NanoBertForClassification(
    vocab_size=len(tokenizer.vocab),
    n_layers=3,  # Increased number of layers for better feature extraction
    n_heads=4,  # Increased attention heads
    max_seq_len=128,
    n_classes=1  # Predicting similarity
).to(device)

# Unsupervised dataset generation using random sentence pairs
class UnsupervisedDataset(Dataset):
    def __init__(self, sentences, tokenizer):
        self.sentences = sentences
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        sent1, sent2 = random.sample(self.sentences, 2)  # Randomly sample sentence pairs
        tokens1 = torch.tensor(self.tokenizer(sent1), dtype=torch.long).clone().detach()
        tokens2 = torch.tensor(self.tokenizer(sent2), dtype=torch.long).clone().detach()
        return tokens1, tokens2

# Expanded dataset with diverse sentence pool
sentences = [
    "The sun sets behind the mountains.",
    "She enjoys reading books at night.",
    "The cat is sleeping on the couch.",
    "They went for a hike in the woods.",
    "The pizza was delivered in 30 minutes.",
    "He bought a new laptop for work.",
    "The concert was loud and energetic.",
    "She prefers coffee over tea.",
    "The children played outside in the park.",
    "He forgot his umbrella on a rainy day.",
    "The company announced a new product.",
    "She completed the marathon in record time."
]

# Expand dataset by adding more variations
for _ in range(500):
    s1, s2 = random.sample(sentences, 2)
    sentences.append(s1)
    sentences.append(s2)

dataset = UnsupervisedDataset(sentences, tokenizer)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)  # Increased batch size for stability

# Function to get sentence embeddings
def get_embedding(sentence, model, tokenizer, device):
    tokens = torch.tensor(tokenizer(sentence), dtype=torch.long).unsqueeze(0).to(device)  # Tokenize and add batch dimension
    with torch.no_grad():
        outputs = model(tokens)
        embedding = outputs.mean(dim=1)  # Average over token embeddings
    return F.normalize(embedding, p=2, dim=1)  # Normalize for cosine similarity

# Define optimizer and loss function
optimizer = torch.optim.AdamW(bert.parameters(), lr=5e-5, weight_decay=1e-4)  # Adjusted optimizer and learning rate
loss_fn = torch.nn.CosineEmbeddingLoss()

# Unsupervised training using contrastive loss
NUM_EPOCHS = 25  # Increased number of epochs
for epoch in range(NUM_EPOCHS):
    total_loss = 0
    for tokens1, tokens2 in dataloader:
        tokens1, tokens2 = tokens1.to(device), tokens2.to(device)

        # Get embeddings
        emb1 = F.normalize(bert(tokens1).squeeze(), p=2, dim=-1)
        emb2 = F.normalize(bert(tokens2).squeeze(), p=2, dim=-1)

        # Define pseudo-labels (1 for similar, -1 for dissimilar)
        labels = torch.ones(tokens1.shape[0], device=device)  # Assume similarity for unsupervised learning

        # Compute loss
        loss = loss_fn(emb1, emb2, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss / len(dataloader)}")

# Example usage
sentence1 = "The weather is nice today."
sentence2 = "It is a beautiful day outside."
emb1 = get_embedding(sentence1, bert, tokenizer, device)
emb2 = get_embedding(sentence2, bert, tokenizer, device)
similarity_score = torch.cosine_similarity(emb1, emb2, dim=-1).item()
print(f"Sentence Similarity after unsupervised training: {similarity_score}")

# Fix for tensor dimension issue in Nano-BERT model
x = torch.randint(0, 10, (2, 5))  # Example input
print(f"Shape of x before expand: {x.shape}")
mask = (x > 0).unsqueeze(1).expand(-1, x.size(1), x.size(1))
print(f"Mask shape after expand: {mask.shape}")

output:

Epoch 1, Loss: 0.0719389957957901
Epoch 2, Loss: 0.0618745360407047
Epoch 3, Loss: 0.0526621182798408
Epoch 4, Loss: 0.04714863235130906
Epoch 5, Loss: 0.04106081632198766
Epoch 6, Loss: 0.037101920577697456
Epoch 7, Loss: 0.033840778516605496
Epoch 8, Loss: 0.03037687105825171
Epoch 9, Loss: 0.027291643258649856
Epoch 10, Loss: 0.024450804630760103
Epoch 11, Loss: 0.022206916415598243
Epoch 12, Loss: 0.020052441221196204
Epoch 13, Loss: 0.018441481806803495
Epoch 14, Loss: 0.01650270482059568
Epoch 15, Loss: 0.014854293840471655
Epoch 16, Loss: 0.013568208261858672
Epoch 17, Loss: 0.01220284914597869
Epoch 18, Loss: 0.011220443062484264
Epoch 19, Loss: 0.01014581840718165
Epoch 20, Loss: 0.009260938444640487
Epoch 21, Loss: 0.008522654767148197
Epoch 22, Loss: 0.007597889984026551
Epoch 23, Loss: 0.006905672722496092
Epoch 24, Loss: 0.006327742012217641
Epoch 25, Loss: 0.005692904756870121
Sentence Similarity after unsupervised training: 1.0
Shape of x before expand: torch.Size([2, 5])
Mask shape after expand: torch.Size([2, 5, 5])

The unsupervised training approach is showing strong convergence, as evidenced by the rapidly decreasing loss values. Here's a breakdown of the results:

Observations:

  1. Consistent Loss Reduction

    • The loss starts at 0.0719 and steadily declines to 0.0057 over 25 epochs.

    • This indicates that the model is effectively learning meaningful sentence representations without explicit labels.

  2. Smooth Convergence

    • Unlike supervised approaches, where loss might fluctuate, this unsupervised contrastive learning shows a smooth decline.

    • This suggests that your contrastive loss (CosineEmbeddingLoss) is working well in guiding the embeddings toward a meaningful similarity space.

  3. High Similarity Score (1.0)

    • The final sentence similarity score is 1.0, which implies that embeddings for different but semantically related sentences are becoming highly aligned.

    • This could indicate overfitting to the dataset if the diversity of sentence pairs is not sufficient.

Potential Improvements:

  • Hard Negative Mining:

    • Instead of randomly selecting sentence pairs, introduce hard negatives (sentences that appear similar but have different meanings).
  • Larger Sentence Pool:

    • Since the dataset is relatively small (expanded artificially), adding real-world diverse sentences could improve generalization.
  • Temperature Scaling in Similarity Calculation:

    • Applying temperature scaling (e.g., Softmax normalization in contrastive loss) could prevent embedding collapse.
0
Subscribe to my newsletter

Read articles from Mohamad Mahmood directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mohamad Mahmood
Mohamad Mahmood

Mohamad's interest is in Programming (Mobile, Web, Database and Machine Learning). He studies at the Center For Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia (UKM).