Embeddings in NLP: From One-Hot to Transformers

Tanayendu BariTanayendu Bari
9 min read

Introduction: What Are Embeddings?

In natural language processing (NLP) and deep learning, embeddings refer to a technique used to transform discrete, symbolic data—such as words, characters, or even entire sentences—into continuous, dense vectors in a high-dimensional space. This process is essential because most machine learning models cannot directly work with raw text; they require numerical input.

A graph of embeddings generated by Word2Vec

Embeddings provide a way to represent these symbolic inputs as numerical vectors that capture semantic meaning and relationships. For instance, in a well-trained embedding space, similar words (like king and queen, or Paris and France) tend to be located near each other, reflecting their related meanings or usage contexts. This allows models to not only understand the input in numerical form but also to leverage linguistic patterns and contextual similarities during learning.

In essence, embeddings are the bridge between human language and mathematical models—converting text into a format that algorithms can understand and learn from.

Why Do We Need Embeddings in NLP?

At its core, a computer doesn’t understand language the way humans do. It sees everything as numbers—binary digits, to be precise. When we work with natural language, we deal with words, sentences, and meaning. But machines can’t process raw text directly, because models and algorithms operate mathematically and require numerical input.

For example, the word “apple” may mean a fruit or a tech company depending on the context—but to a computer, it's just a string of characters. If we try to feed that string directly into a model, it won’t know what to do with it.

That’s where embeddings come in.

Embeddings convert discrete tokens (like words or subwords) into continuous numerical vectors. These vectors are not random—they're learned in such a way that they capture important information about the meaning, usage, and context of each word. Words that appear in similar contexts (like king and queen) end up with similar vector representations.

This transformation is crucial for several reasons:

  • Numerical Input for Models: Machine learning models like neural networks require numerical input. Embeddings enable that transformation.

  • Semantic Understanding: Unlike simple encodings like one-hot vectors, embeddings preserve relationships between words (e.g., man to king is similar to woman to queen).

  • Efficiency: Dense embeddings reduce the dimensionality of input compared to sparse representations, making learning faster and more memory-efficient.

  • Generalization: With well-learned embeddings, models can generalize better, understanding even unseen words based on their proximity to known ones in vector space.

In summary, embeddings act as the foundation for modern NLP, enabling machines to process, learn from, and generate human language with much greater fluency and accuracy.

Real-World Applications of Embeddings

Embeddings aren't just a theoretical concept—they're at the heart of many real-world NLP applications that we use every day. By converting text into meaningful vectors, embeddings help machines understand and work with language more effectively. Here are some key use cases:


Traditional keyword-based search engines often fail when users phrase queries differently than the stored content. With embeddings, search systems can go beyond exact keyword matching and focus on meaning.

Example: Searching “How to fix a leaking tap” can return articles with titles like “Plumbing guide for faucet repair”, because embeddings recognize the semantic similarity.


2. Machine Translation

Embeddings are crucial in neural machine translation systems like Google Translate. They help represent words and phrases from different languages in a shared vector space, enabling the model to translate meaningfully instead of word-for-word.

Example: The English word “cat” and the Spanish word “gato” can have very similar embeddings in a multilingual model, making translation more accurate.


3. Sentiment Analysis

In sentiment analysis, embeddings allow models to detect tone and emotion in text. Words like “amazing”, “excellent”, and “wonderful” cluster together in vector space, helping the model classify the sentiment as positive, even when the wording changes.

Example: A review like “The movie was a blast!” can be classified as positive even if the word “good” isn't used explicitly.


4. Text Classification

From spam detection to topic modeling, embeddings enable classification models to group and label text based on content and context.

Example: Emails with phrases like “win money now” or “claim your prize” might be far apart textually but close in embedding space—helping models detect spam reliably.


5. Question Answering & Chatbots

Embeddings help AI systems understand user queries and generate relevant, context-aware responses. They form the backbone of modern chatbots and question-answering systems like ChatGPT.

Example: A customer support bot can match a user's question to similar past queries, even if phrased differently, to provide quick and accurate answers.

Why Not Just Use One-Hot Encoding?

Before the age of deep learning and smart language models, NLP relied on a simpler way to represent words—one-hot encoding. Think of it like writing names on identical name tags: easy to assign, but not very informative.

At first glance, it seems like a decent approach. Each word gets its own unique vector—a list full of zeros with a single 1 in the spot that belongs to that word.

What Exactly is One-Hot Encoding?

Let’s say our vocabulary looks like this:

["cat", "dog", "apple", "car"]

With one-hot encoding, we represent each word as:

WordOne-Hot Vector
cat[1, 0, 0, 0]
dog[0, 1, 0, 0]
apple[0, 0, 1, 0]
car[0, 0, 0, 1]

Every word gets a unique vector the size of the entire vocabulary. The vector is mostly zeros—with just one "hot" 1.

Now imagine doing this for a vocabulary of 100,000 words. That’s 100,000-dimensional vectors, most of which are zeros—clearly not very memory-friendly or smart.

Why It Was Used

  • Simplicity: One-hot encoding is dead simple. No training, no math. Just assign an index and go.

  • Interpretability: Each vector’s position maps directly to a known word. You know exactly what you're looking at.

Why It Falls Apart

Here’s where the cracks show:

  • No meaning: One-hot vectors don’t capture any meaning. To the model, cat and dog are just as unrelated as cat and car—even though one pair are pets, and the other... not so much.

  • No similarity: Every word is orthogonal to the others. There's no way to measure how alike king and queen are, or how far apple is from banana.

  • No learning: You can't improve these vectors. They're fixed and static—no way to evolve as your model gets smarter.

  • Sparse and inefficient: With large vocabularies, you’re just wasting memory storing mostly zeros.

Code Example: One-Hot Encoding vs Embeddings

import numpy as np
import torch
import torch.nn as nn

# Sample vocabulary
vocab = ["cat", "dog", "apple", "car"]
word_to_index = {word: idx for idx, word in enumerate(vocab)}

# One-Hot Encoding
def one_hot_encode(word, vocab_size=len(vocab)):
    vec = np.zeros(vocab_size)
    vec[word_to_index[word]] = 1
    return vec

print("One-hot encoding for 'cat':")
print(one_hot_encode("cat"))  # Output: [1. 0. 0. 0.]

# Embedding Layer (PyTorch)
embedding_dim = 3
embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embedding_dim)

# Input word index
word_idx = torch.tensor([word_to_index["cat"]])

# Get embedding vector
embedding_vector = embedding_layer(word_idx)

print("\nLearned embedding for 'cat':")
print(embedding_vector)

Output Sample:

One-hot encoding for 'cat':
[1. 0. 0. 0.]

Learned embedding for 'cat':
tensor([[ 0.1675, -0.4910,  0.3022]])

One-hot encoding may have been a good starting point, but language is full of nuance, context, and subtle relationships. Embeddings bring language into a space where words carry meaning, relationships are learned, and models get smarter with time.

That’s why, in today’s NLP world, embeddings are the backbone of everything from translation to chatbots to semantic search.

Types of Embeddings

Now that we understand what embeddings are and why they're better than one-hot encodings, let’s explore the different types of embeddings used in natural language processing. Over the years, researchers have developed various techniques to generate embeddings—ranging from simple context-independent vectors to powerful context-aware representations.

Techniques-in-Embeddings-in-NLP

Frequency-Based Embeddings

Before deep learning models like Word2Vec and BERT took over NLP, frequency-based embeddings were among the earliest and most intuitive ways to represent words as numerical vectors. These methods rely on the frequency and co-occurrence of words in a large corpus to create meaningful representations.

Let’s break it down.

Bag of Words

Bag of Words (BoW) is a foundational technique that represents text by counting the frequency of each word in a document, disregarding grammar and word order. Each document is converted into a vector where each dimension corresponds to a word from the vocabulary, and the value is the number of times that word appears. BoW is simple, easy to implement, and works well for many basic NLP tasks, but it cannot capture the context or meaning of words.

TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) builds on BoW by not only counting word occurrences but also weighing them according to how unique or important they are to a specific document in a corpus. Words that appear frequently in one document but not in many others get higher scores, helping to highlight keywords and reduce the influence of common words. TF-IDF is widely used in information retrieval and text mining.

Co-occurrence Matrices

Co-occurrence matrices capture the frequency with which pairs of words appear together within a certain window in the text. This method provides more information than BoW by reflecting relationships between words, which can be used for more advanced analyses like topic modeling and building word association networks. However, co-occurrence matrices can become very large and sparse with big vocabularies.


2. Prediction Based Word Embedding

Word2Vec (CBOW, Skip-gram)

Word2Vec is a neural network-based method that learns vector representations of words by predicting a word based on its context (Continuous Bag of Words, CBOW) or predicting the context based on a word (Skip-gram). These embeddings capture semantic relationships, so words used in similar contexts have similar vectors. Word2Vec revolutionized NLP by providing dense, meaningful word representations.

GloVe

Global Vectors for Word Representation (GloVe) is another prediction-based embedding technique that combines the benefits of matrix factorization and context prediction. GloVe constructs word vectors by analyzing global word-word co-occurrence statistics from a corpus, resulting in embeddings that capture both local and global statistical information about words.

FastText

FastText, developed by Facebook, extends Word2Vec by representing words as bags of character n-grams. This allows it to generate embeddings for rare or out-of-vocabulary words and better handle morphologically rich languages. FastText is especially useful for dealing with misspellings or new words.


3. Contextualized Word Embeddings

ELMO

Embeddings from Language Models (ELMO) generate word representations that are context-dependent, meaning the same word can have different embeddings depending on its surrounding words. ELMO uses deep, bi-directional LSTM networks to capture complex characteristics of word use, making it much more powerful for tasks where context is crucial.

BERT

Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based model that produces highly contextualized word embeddings by considering both left and right context in all layers. BERT has set new standards in many NLP benchmarks and enables fine-tuning for a wide range of language tasks, from question answering to sentiment analysis.

GPT

Generative Pretrained Transformer (GPT) models are transformer-based architectures designed for generating text and understanding context. GPT embeddings are contextualized, meaning they change based on the sentence in which a word appears. GPT models are particularly strong in text generation and conversational AI applications.

Conclusion

This structure gives your readers a clear, organized overview of the main types of word embeddings in NLP—how they work, their strengths, and where they fit in the evolution of language processing techniques. Each section can be expanded with examples, code snippets, or real-world applications for added depth.

0
Subscribe to my newsletter

Read articles from Tanayendu Bari directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tanayendu Bari
Tanayendu Bari