Creating BERT Static Word Embedding Model

Mohamad MahmoodMohamad Mahmood
2 min read

Note: BERT is designed for contextual embeddings. Creating static embedding from BERT therefore defeats its purpose.

[1] Install Required Libraries

Ensure that the necessary libraries installed i.e. torch, gensim, numpy, and transformers (Hugging Face's library).

!pip install torch gensim numpy transformers

[2] Load the PreTrained BERT Model

Load the pretrained BERT Model from the library using the from_pretrained method.

from transformers import BertTokenizer, BertModel

model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

[3] Extract Static Word Embeddings

Extract the static word embeddings from the BERT model.

Iterate over the list of words and pass each word through the BERT tokenizer.

Then, use the BERT model to obtain the corresponding word embeddings.

import numpy as np
import torch

words=['man','queen','london','woman','washington','king']

word_vectors = {}
for word in words:
    encoded_input = tokenizer.encode_plus(word, add_special_tokens=True, return_tensors='pt')
    with torch.no_grad():
        embeddings = model(**encoded_input)[0].numpy()
    word_vectors[word] = np.mean(embeddings, axis=1).squeeze()

The encode_plus method tokenizes the word and adds special tokens required by BERT.

The resulting word embeddings are averaged along the sequence dimension to obtain a fixed-size vector for each word.

[4] Save Word Embeddings in Gensim's KeyedVectors Format

Save the word embeddings in the KeyedVectors format using Gensim.

This format allows the embeddings to be loaded and used efficiently.

from gensim.models import KeyedVectors
embedding_size=768
wv = KeyedVectors(vector_size=embedding_size)
wv.add_vectors(list(word_vectors.keys()), list(word_vectors.values()))
wv.save_word2vec_format('bert_static_embeddings_768.bin', binary=True)

In the above code, embedding_size represents the dimensionality of the BERT word embeddings. The add_vectors method adds the word and embedding pairs to the KeyedVectors model, and save_word2vec_format saves the embeddings in the binary format.

test wv:

# test wv
wv.most_similar('king')

[5] Load Word Embeddings in Gensim's KeyedVectors Format

wv_from_bin = KeyedVectors.load_word2vec_format("bert_static_embeddings_768.bin", binary=True)
wv_from_bin

test wv_from_bin

# test wv_from_bin
wv_from_bin.most_similar('king')

Colab Notebook:

https://colab.research.google.com/drive/1E62blnJpJrgwEOW97RFZqbv2lPfqIw4W?usp=sharing

0
Subscribe to my newsletter

Read articles from Mohamad Mahmood directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mohamad Mahmood
Mohamad Mahmood

Mohamad's interest is in Programming (Mobile, Web, Database and Machine Learning). He studies at the Center For Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia (UKM).