Natural Language Processing (NLP) has taken tremendous strides over the last decade, and at the heart of many modern NLP techniques lies the idea of word embeddings—vector representations of words that capture their semantic and syntactic meanings.

One of the earliest breakthroughs in this field came from Mikolov et al., in the seminal 2013 paper:

📄 Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

This paper introduced Word2Vec, a family of models that learns embeddings by predicting words from their context or vice versa. A key optimization introduced in the paper was Negative Sampling, which allows the model to scale to massive corpora by simplifying the training objective.

📚 What is Word2Vec?

Word2Vec is a shallow neural network that converts words into dense vectors based on the context in which they appear. It comes in two main architectures:

CBOW (Continuous Bag of Words): Predicts a word from its surrounding context.
Skip-Gram: Predicts surrounding context from a given word.

In this post, we’ll build Skip-Gram with Negative Sampling from scratch, using only PyTorch and NumPy on a toy dataset. This minimal example is perfect to learn the math and code behind one of NLP's most famous algorithms.

🔧 1. Importing Dependencies

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

We use:

NumPy for data manipulation
PyTorch for neural network modeling
Matplotlib for visualizing training loss

🧾 2. Creating a Simple Corpus

corpus = ["apple banana fruit", "banana apple fruit", "banana fruit apple",
                 "dog cat animal", "cat animal dog", "cat dog animal"]

This is a toy dataset with two semantic groups:

Fruit-related: apple, banana, fruit
Animal-related: dog, cat, animal

📚 3. Tokenizing the Corpus

corpus = [sent.split(" ") for sent in corpus]

Each sentence is split into individual words, converting the corpus into a list of lists.

🎯 4. Preparing Training Data (Skip-Gram)

pythonCopyEditwindow_size = 1
training_data = []

for sentence in corpus:
    for center_pos in range(len(sentence)):
        center_word = sentence[center_pos]
        for w in range(-window_size, window_size + 1):
            context_pos = center_pos + w
            if context_pos < 0 or context_pos >= len(sentence) or context_pos == center_pos:
                continue
            context_word = sentence[context_pos]
            training_data.append((word2idx[center_word], word2idx[context_word]))

Generates (center, context) word pairs using a context window of size 1.

📉 5. Negative Sampling

pythonCopyEditword_freqs = np.array(list(word_counts.values()), dtype=np.float32)
word_freqs = word_freqs / word_freqs.sum()
word_freqs = word_freqs ** (3/4)
word_freqs = word_freqs / word_freqs.sum()

def negative_sampling(targets,unigram_table,k):
  batch_size = targets.shape[0]
  neg_samples = []
  for i in range(batch_size):
    nsample = []
    target_index = targets[i].item()
    while len(nsample) < k:
      neg = random.choice(unigram_table)
      if word2index[neg] == target_index:
        continue
      nsample.append(neg)
    neg_samples.append(prepare_seq(nsample,word2index))
  return torch.stack(neg_samples)

batch_size = 2
x, y  = random_batch(batch_size,corpus)
x_tensor = torch.LongTensor(x)
y_tensor = torch.LongTensor(y)

Implements the Word2Vec model with:

Two embedding matrices (input/output)
Binary cross-entropy loss with negative sampling

⚙️ 6. Training the Model

class SkipgramNegSampling(nn.Module):
  def __init__(self,vocab_size,embed_size):
    super(SkipgramNegSampling,self).__init__()
    self.embedding_v = nn.Embedding(vocab_size,embed_size) #center
    self.embedding_u = nn.Embedding(vocab_size,embed_size) # out embedding
    self.logsigmoid = nn.LogSigmoid()

  def forward(self,center_words,target_words,negative_words):
    center_embeds = self.embedding_v(center_words) # [batch_size, 1, emb_size]
    target_embeds = self.embedding_u(target_words) # [batch_size, 1, emb_size]
    neg_embeds    = -self.embedding_u(negative_words) # [batch_size, num_neg, emb_size]
    positive_score = target_embeds.bmm(center_embeds.transpose(1, 2)).squeeze(2)
    #[batch_size, 1, emb_size] @ [batch_size, emb_size, 1] = [batch_size, 1, 1] = [batch_size, 1]
    negative_score = neg_embeds.bmm(center_embeds.transpose(1, 2))
        #[batch_size, k, emb_size] @ [batch_size, emb_size, 1] = [batch_size, k, 1]
    loss = self.logsigmoid(positive_score) + torch.sum(self.logsigmoid(negative_score), dim=1)

    return torch.mean(loss)

  def prediction(self,input):
    embeds = self.embedding_v(input)
    return embeds

Training Loop

batch_size     = 2 # mini-batch size
embedding_size = 2 #so we can later plot
model          = SkipgramNegSampling(voc_size, embedding_size)
num_neg        = 10 # num of negative sampling

optimizer = optim.Adam(model.parameters(), lr=0.001)

def epoch_time(start_time,end_time):
  elapsed_time = end_time - start_time
  elapsed_mins = int(elapsed_time / 60)
  elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
  return elapsed_mins, elapsed_secs

import time

num_epochs = 10000

for epoch in range(num_epochs):
  start = time.time()

  input_batch, target_batch = random_batch(batch_size, corpus)
  input_batch = torch.LongTensor(input_batch)
  target_batch = torch.LongTensor(target_batch)
  neg_batch = negative_sampling(target_batch, unigram_table, num_neg)

  optimizer.zero_grad()
  loss = model(input_batch,target_batch,neg_batch)

  end = time.time()
  epoch_mins, epoch_secs = epoch_time(start, end)
  loss.backward()
  optimizer.step()

  if (epoch + 1) % 1000 == 0:
    print(f"Epoch : {epoch + 1} | cost : {loss:.6f} | time : {epoch_mins}m {epoch_secs}")

Trains the model for 10000 epochs and prints loss every 1000 epochs.

📊 7. Plotting the Embeddings

def get_embedding(word):
  id_tensor = torch.LongTensor([word2index[word]])
  v_embed = model.embedding_v(id_tensor)
  u_embed = model.embedding_u(id_tensor)
  word_embed = (v_embed + u_embed) / 2
  x,y = word_embed[0][0].item(), word_embed[0][1].item()
  return x,y

plt.figure(figsize=(6,3))
for i, word in enumerate(vocab[:20]): #loop each unique vocab
    x, y = get_embedding(word)
    plt.scatter(x, y)
    plt.annotate(word, xy=(x, y), xytext=(5, 2), textcoords='offset points')
plt.show()

Visualizes how loss decreases as the model learns better embeddings.

🧠 Conclusion

In this post, you learned:

How Word2Vec works at a low level
Why negative sampling is efficient
How to build and train the model from scratch using PyTorch

This minimal implementation gives you complete control and understanding of the inner workings of word embeddings.

Dream.Achieve.Repeat

LECTURE-2 : Implementing Word2Vec with Negative Sampling from Scratch

Table of contents