From Words to Vectors: Understanding the Magic of Text Embedding

karthik hublikarthik hubli
7 min read

Transformers are the backbone of large language models (LLMs) and have revolutionized natural language processing by enabling models to handle vast amounts of text efficiently. Unlike traditional neural networks, transformers use a self-attention mechanism, which allows the model to focus on different parts of the input text simultaneously. This attention-based approach helps transformers understand the context and relationships between words, regardless of their position in a sentence. Transformers excel at processing long sequences of text, capturing both local and global dependencies, which makes them highly effective for tasks like translation, summarization, and text generation. Their scalability and versatility have made them the architecture of choice for state-of-the-art LLMs like GPT, BERT, and T5.

LLMs, inherently are programs running on a large sophisticated and powerful (could be your phone too if the model is optimized) computer performing various mathematical operations on the text entered. But, in basics of computer science, we learnt that a computer can only understand numbers, how is it interpreting the text here? In the above figure, if you look at the blue boxes at the bottom, is the key component of a LLM that enables this.

Two key concepts known as tokenization and word embedding enable the model to interpret a text in English or any other language and convert it to set of numbers or vectors.

Tokenizers and Word embedding

Tokenization is a fundamental step in Natural Language Processing (NLP), where text is broken down into smaller units called tokens. These tokens can range from entire words to subwords or individual characters, depending on the task and the model in use. Tokenization allows language models to convert raw text into a structured, manageable format that can be processed. This step is essential for models like GPT and BERT, enabling them to understand and generate language more effectively, including handling rare, complex, or compound words.

Tokenizers are critical because language models cannot process whole sentences or paragraphs directly. Instead, they rely on tokens to create interpretable chunks of information, ensuring that every part of the input text is accessible. Different tokenization strategies balance between vocabulary size, computational efficiency, and language diversity.

A long text can be broken down to tokens in several ways. This can be as simple as breaking a sentence into words and each word representing a token or can be more complex where one word may be split into multiple tokens or multiple words may be combined to form one token. In the example below, we see the different approaches to tokenizing the same text.

Sample Tokens

Each tokenizer brings distinct advantages based on the language, task complexity, and desired balance between granularity and efficiency.

Once tokenized, the next step is embedding, where each token is mapped into a high-dimensional vector space. These embeddings serve as numerical representations of words or tokens, capturing the semantic and contextual relationships between them. By placing similar words close to each other in the embedding space, LLMs can better understand the meaning and nuances of language. For example, words like “king” and “queen” might appear close together in the vector space due to their similar meanings and contexts.

The importance of tokenizers and embeddings cannot be overstated. Without tokenization, LLMs would be unable to handle language input efficiently. Without embeddings, the models would lack the ability to grasp the underlying structure and meaning of the text. Together, these components enable LLMs to understand, interpret, and generate human language with remarkable accuracy and coherence, fueling advances in NLP tasks like translation, summarization, and conversational AI.

Assigning IDs

Once the text is broken down into tokens, a unique ID is assigned to each token which is picked from the corpus. Corpus or vocabulary here represents the collection of all possible tokens in a given language. For example, all possible words in in English language for it’s corpus. Similarly, Mandarin language will have it’s own corpus.

If a token is not part of the corpus (like names), some additional characters are added to identify that and normalize the values. This normalized value is used to pick an id from the corpus.

Text Embedding

In the previous example for tokenization and assigning IDs the word “earth” is represented by the number 3011. This gives it a unique number, but doesn’t convey any other information. It would be nice if we could embed some other information like is its planet? Does that have life? How many satellites does it have? If we look at the Id’s for “Great” is 2307, while that of “Awesome” is 12476, it’s not clear that both words have very similar meaning. A neural network or an LLM will need a lot more information to properly process and extract the information from the token. To achieve this, text embedding is used.

Text embedding converts a token (or set of tokens) into series of vectors, encoding a lot of attributes of the token. These vectors can have several hundred dimensions, making them information dense. This is known as the embedding space. For ease of understanding, lets assume we have jut 2 dimensions and visualize a simple embedding.

The attributes of each token can be normalized and represented as probability. In the example below, we can get a lot of information about the token other than imparting a unique number to the token.

In the example above we can see that there is lot of similarity between cat and dog while horse and penguin and less similar but share some attributes. Car is the most distinct in the group. If we plot the vector, we can see that tokens with similar attributes are clustered together.

We can perform mathematical operations on the vectors resulting in some interesting outcomes. We can find the relative distance between the words in the embedding space. There are several techniques (like cosine similarity) to find the distance between the vectors. An example can be visualized in 2-dimensional embedding space. Here we see that the distance between ‘hot’ and ‘cold’ is very similar to that of ‘fire’ and ‘ice’.

In other ways, if we know the distance between ‘hot’ and ‘cold’ and we know the vector for ‘ice’, the vector for ‘fire’ can be estimated with relatively good accuracy.

The embeddings can be sparse or dense in nature.

Sparse vector embeddings are high-dimensional vectors with mostly zero values, commonly used in earlier NLP methods. Examples include:

One-Hot Encoding: Each word is represented by a vector of all zeros, except for a single 1 at the position corresponding to the word. For example, “dog” in a vocabulary of 5 words could be [0, 1, 0, 0, 0].

Bag-of-Words (BoW): Represents a document by the frequency of each word in the vocabulary. A sentence like “cat eats fish” might have a vector [1, 0, 1, 0].

TF-IDF: Weights words based on their frequency across documents, giving rare words higher values.

Count Vectorization: Similar to BoW, representing words by their count in the text.

Sparse vectors are large and inefficient, unlike dense embeddings like Word2Vec.

Dense embedding vectors are low-dimensional, continuous vector representations where most of the elements are non-zero. These embeddings capture semantic relationships between words or concepts. Here are a few examples:

Word2Vec: Word2Vec represents each word as a dense vector (e.g., 100–300 dimensions) where similar words have similar vector values. Example: The word “king” might have a vector like [0.21, -0.34, 0.45, …], and “queen” would have a similar vector, reflecting their semantic relationship.

GloVe (Global Vectors for Word Representation): GloVe is another popular word embedding technique. It captures co-occurrence statistics of words in a corpus and produces dense vectors. Example: “Paris” could be represented as [0.41, 0.37, -0.59, …] and “France” as [0.50, 0.40, -0.65, …], reflecting their geographical relationship.

BERT Embeddings: BERT generates context-aware dense embeddings, meaning the same word can have different vectors based on its usage in a sentence. Example: “bank” in “river bank” may have a vector like [0.12, -0.47, 0.31, …], while “bank” in “financial bank” would have a different vector.

FastText: FastText generates dense embeddings by taking into account subword information (character n-grams), which allows it to handle rare and misspelled words. Example: The word “play” could be represented as [0.18, 0.23, -0.31, …].

Dense vectors are typically compact (e.g., 100–300 dimensions) and capture richer semantic meaning than sparse vectors, making them more efficient for NLP tasks.

Word embeddings are crucial in generative AI models because they provide a way to represent words in a meaningful, dense, and continuous vector space that captures semantic relationships and contextual information. By transforming language into these structured representations, embeddings enable models to understand and generate human-like text. They form the foundation for advanced architectures like transformers, which rely on embeddings to process and learn from vast amounts of data. These embeddings allow models to generalize across different contexts, improve language comprehension, and enhance the quality of generated text, making word embeddings a vital component in the success of modern generative AI applications like machine translation, chatbots, and creative content generation.

0
Subscribe to my newsletter

Read articles from karthik hubli directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

karthik hubli
karthik hubli