LECTURE-2 : Implementing Word2Vec with Negative Sampling from Scratch


Natural Language Processing (NLP) has taken tremendous strides over the last decade, and at the heart of many modern NLP techniques lies the idea of word embeddingsβvector representations of words that capture their semantic and syntactic meanings.
One of the earliest breakthroughs in this field came from Mikolov et al., in the seminal 2013 paper:
π Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean
This paper introduced Word2Vec, a family of models that learns embeddings by predicting words from their context or vice versa. A key optimization introduced in the paper was Negative Sampling, which allows the model to scale to massive corpora by simplifying the training objective.
π What is Word2Vec?
Word2Vec is a shallow neural network that converts words into dense vectors based on the context in which they appear. It comes in two main architectures:
CBOW (Continuous Bag of Words): Predicts a word from its surrounding context.
Skip-Gram: Predicts surrounding context from a given word.
In this post, weβll build Skip-Gram with Negative Sampling from scratch, using only PyTorch and NumPy on a toy dataset. This minimal example is perfect to learn the math and code behind one of NLP's most famous algorithms.
π§ 1. Importing Dependencies
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
We use:
NumPy for data manipulation
PyTorch for neural network modeling
Matplotlib for visualizing training loss
π§Ύ 2. Creating a Simple Corpus
corpus = ["apple banana fruit", "banana apple fruit", "banana fruit apple",
"dog cat animal", "cat animal dog", "cat dog animal"]
This is a toy dataset with two semantic groups:
Fruit-related:
apple, banana, fruit
Animal-related:
dog, cat, animal
π 3. Tokenizing the Corpus
corpus = [sent.split(" ") for sent in corpus]
Each sentence is split into individual words, converting the corpus into a list of lists.
π― 4. Preparing Training Data (Skip-Gram)
pythonCopyEditwindow_size = 1
training_data = []
for sentence in corpus:
for center_pos in range(len(sentence)):
center_word = sentence[center_pos]
for w in range(-window_size, window_size + 1):
context_pos = center_pos + w
if context_pos < 0 or context_pos >= len(sentence) or context_pos == center_pos:
continue
context_word = sentence[context_pos]
training_data.append((word2idx[center_word], word2idx[context_word]))
Generates (center, context) word pairs using a context window of size 1.
π 5. Negative Sampling
pythonCopyEditword_freqs = np.array(list(word_counts.values()), dtype=np.float32)
word_freqs = word_freqs / word_freqs.sum()
word_freqs = word_freqs ** (3/4)
word_freqs = word_freqs / word_freqs.sum()
def negative_sampling(targets,unigram_table,k):
batch_size = targets.shape[0]
neg_samples = []
for i in range(batch_size):
nsample = []
target_index = targets[i].item()
while len(nsample) < k:
neg = random.choice(unigram_table)
if word2index[neg] == target_index:
continue
nsample.append(neg)
neg_samples.append(prepare_seq(nsample,word2index))
return torch.stack(neg_samples)
batch_size = 2
x, y = random_batch(batch_size,corpus)
x_tensor = torch.LongTensor(x)
y_tensor = torch.LongTensor(y)
Implements the Word2Vec model with:
Two embedding matrices (input/output)
Binary cross-entropy loss with negative sampling
βοΈ 6. Training the Model
class SkipgramNegSampling(nn.Module):
def __init__(self,vocab_size,embed_size):
super(SkipgramNegSampling,self).__init__()
self.embedding_v = nn.Embedding(vocab_size,embed_size) #center
self.embedding_u = nn.Embedding(vocab_size,embed_size) # out embedding
self.logsigmoid = nn.LogSigmoid()
def forward(self,center_words,target_words,negative_words):
center_embeds = self.embedding_v(center_words) # [batch_size, 1, emb_size]
target_embeds = self.embedding_u(target_words) # [batch_size, 1, emb_size]
neg_embeds = -self.embedding_u(negative_words) # [batch_size, num_neg, emb_size]
positive_score = target_embeds.bmm(center_embeds.transpose(1, 2)).squeeze(2)
#[batch_size, 1, emb_size] @ [batch_size, emb_size, 1] = [batch_size, 1, 1] = [batch_size, 1]
negative_score = neg_embeds.bmm(center_embeds.transpose(1, 2))
#[batch_size, k, emb_size] @ [batch_size, emb_size, 1] = [batch_size, k, 1]
loss = self.logsigmoid(positive_score) + torch.sum(self.logsigmoid(negative_score), dim=1)
return torch.mean(loss)
def prediction(self,input):
embeds = self.embedding_v(input)
return embeds
Training Loop
batch_size = 2 # mini-batch size
embedding_size = 2 #so we can later plot
model = SkipgramNegSampling(voc_size, embedding_size)
num_neg = 10 # num of negative sampling
optimizer = optim.Adam(model.parameters(), lr=0.001)
def epoch_time(start_time,end_time):
elapsed_time = end_time - start_time
elapsed_mins = int(elapsed_time / 60)
elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
return elapsed_mins, elapsed_secs
import time
num_epochs = 10000
for epoch in range(num_epochs):
start = time.time()
input_batch, target_batch = random_batch(batch_size, corpus)
input_batch = torch.LongTensor(input_batch)
target_batch = torch.LongTensor(target_batch)
neg_batch = negative_sampling(target_batch, unigram_table, num_neg)
optimizer.zero_grad()
loss = model(input_batch,target_batch,neg_batch)
end = time.time()
epoch_mins, epoch_secs = epoch_time(start, end)
loss.backward()
optimizer.step()
if (epoch + 1) % 1000 == 0:
print(f"Epoch : {epoch + 1} | cost : {loss:.6f} | time : {epoch_mins}m {epoch_secs}")
Trains the model for 10000 epochs and prints loss every 1000 epochs.
π 7. Plotting the Embeddings
def get_embedding(word):
id_tensor = torch.LongTensor([word2index[word]])
v_embed = model.embedding_v(id_tensor)
u_embed = model.embedding_u(id_tensor)
word_embed = (v_embed + u_embed) / 2
x,y = word_embed[0][0].item(), word_embed[0][1].item()
return x,y
plt.figure(figsize=(6,3))
for i, word in enumerate(vocab[:20]): #loop each unique vocab
x, y = get_embedding(word)
plt.scatter(x, y)
plt.annotate(word, xy=(x, y), xytext=(5, 2), textcoords='offset points')
plt.show()
Visualizes how loss decreases as the model learns better embeddings.
π§ Conclusion
In this post, you learned:
How Word2Vec works at a low level
Why negative sampling is efficient
How to build and train the model from scratch using PyTorch
This minimal implementation gives you complete control and understanding of the inner workings of word embeddings.
Dream.Achieve.Repeat
Subscribe to my newsletter
Read articles from GADDAM SAI BHARATH CHANDRA REDDY directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

GADDAM SAI BHARATH CHANDRA REDDY
GADDAM SAI BHARATH CHANDRA REDDY
Code...Design...Create