Understanding AI Language Models: Tokenization, Embeddings & Encoding

Ever wondered how AI translates words into meaning? Let's unravel the inner mechanics of tokenization, vector embeddings, and positional encoding.

Let’s understand the key processes involved in AI-powered response generation.

Tokenization: The text is broken into smaller units (tokens) so the model can process it efficiently.
Vector Embeddings – Each token is mapped to a numerical vector, allowing the AI to understand meanings and relationships between words.
Positional Encoding – Since LLMs process sentences all at once (not sequentially), positional encoding ensures the AI retains word order.
Self-Attention & Context Understanding – The model analyzes relationships between words, adjusting meaning based on surrounding words.
Prediction & Response Generation – AI uses its trained knowledge and probabilities to generate a coherent and context-aware response.

Let’s understand all these processes via a case study.

Case Study: "Bank" in Different Contexts

Introduction:

Language models like GPT don’t “understand” words the way humans do—they rely on patterns, relationships, and probabilities to derive meaning. A single word can mean entirely different things based on context.

The following 2 sentences have “bank“.

"The bank approved my loan."
“I sat by the bank of the river.“

Take the word "bank"—in one sentence, it refers to a financial institution, and in another, the side of a river. How does AI distinguish between meanings? Let's break it down step by step.

Step 1: Tokenization – Breaking Sentences into Pieces

Before AI can process a sentence, it must break it down into tokens. A token can be a word, subword, or even a character.

Example Tokenization

Given our two sentences:

"The bank approved my loan."

Tokenized as:

["The", "bank", "approved", "my", "loan", "."] as [527, 8913, 1248, 204, 4201, 15]

"I sat by the bank of the river."

Tokenized as:

["I", "sat", "by", "the", "bank", "of", "the", "river", "."] as [345, 7896, 682, 527, 8913, 328, 527, 9612, 15]

Key Takeaways:

The word "bank" gets the same token ID (8913) in both sentences, but its meaning differs based on surrounding words.

Step 2: Vector Embeddings – Giving Meaning to Words

Since AI doesn’t process words as text, each token is converted into a numerical vector representation in a high-dimensional space. This ensures that words with similar meanings are close to each other in the vector space.

How "Bank" is Represented Differently

Even though "bank" appears in both sentences, its vector embedding changes depending on surrounding words.

“bank“ in Sentence 1 (financial context):

bank → [0.76, -0.43, 1.22, ...] (closer to "finance", "loan", "approval")
“bank“ in Sentence 2 (geographical context):

bank → [-0.75, 0.22, 0.45, ...] (closer to "river", "shore", "water")

Why Does the Embedding Change?

AI learns from past training data, understanding which words often co-occur.
Self-attention mechanism ensures “bank“ relates to "loan" in the first case and "river" in the second.

Step 3: Positional Encoding – Preserving Sentence Structure

Transformers(LLMs) do not process words sequentially. The transformers process entire sentences at once.

To retain word order, positional encoding assigns each word a unique positional vector, modifying its embedding.

Why Positional Encoding Matters

Imagine you input the sentence:

"The cat is on the cat."
"The mat is on the cat."

Without positional encoding, these sentences would appear identical, since transformers process all words at once.

With positional encoding, AI recognizes word order, ensuring correct comprehension.

Example Positional Encoding: "The bank approved my loan."

Consider Sentence 1 ("The bank approved my loan."):

Key Takeaways:

Base Embeddings – Represent each word’s meaning in a vector space.
Positional Encoding – Adds unique values based on word order, ensuring sentence structure is maintained.
Final Representation – The modified embedding after positional encoding is applied.

Step 4: Response Generation – Understanding Meaning in Context

Once tokenization, embedding, and positional encoding are complete, AI uses self-attention mechanisms to analyze relationships between words.

How AI Determines "Bank"

AI detects “loan“ in Sentence 1 → associates “bank“ with finance
AI detects “river“ in Sentence 2 → associates “bank“ with geography.
Based on learned patterns, it generates the correct response.

Final AI Understanding

Sentence 1: "Bank" → Financial institution
Sentence 2: "Bank" → Riverbank

Conclusion

AI doesn’t rely on individual words—it considers context, relationships, and sentence structure to infer meaning.

By using tokenization, embeddings, and positional encoding, transformers understand language dynamically, making them powerful tools for AI applications.

Breaking Down LLMs: How Words Become Meaning in AI

Table of contents

Case Study: "Bank" in Different Contexts

Introduction:

Subscribe to my newsletter

Apoorva Shukla

Apoorva Shukla