Decoding ai jargons with chai ☕

Suraj PatelSuraj Patel
5 min read

The original Transformer was introduced by Google in the 2017 paper titled "Attention is All You Need" for the specific use case of translating text from one language to another.

AI models use the Transformer to take a piece of text and predict what comes next in the passage. This prediction takes the form of a probability distribution.

🧩 Tokenizations

In this step, the machine breaks the given text into smaller units called tokens. These tokens can be:

  • Words

  • Subwords Characters

  • Or even punctuation marks

For language models like GPT, these tokens are converted into numbers (token IDs) that the model understands.

Example:

Text: “I am understanding tokenization.”

The tokens might look like: [40, 939, 10335, 6602, 2860, 13]

✅ How to Test Tokenization

🔗 Website: https://tiktokenizer.vercel.app

🐍 Or Python script (requires setup below)

Start virtual environment "uv venv" and then activate virtual environment "source .venv/bin/activate"

Install tiktoken "uv pip install tiktoken"

import tiktoken

encoder = tiktoken.encoding_for_model("gpt-4o")

text = "I am understanding tokenization."
tokens = encoder.encode(text)

print("Tokens: ", tokens)

📚 Vocab Size

Vocabulary size refers to the total number of unique tokens that a tokenizer can recognize and assign a unique ID to. In simple terms, It's the size of the "dictionary" that the model knows.

Script to check gpt-4o vocab Size

 import tiktoken

encoder = tiktoken.encoding_for_model("gpt-4o")

print("Vocab Size: ", encoder.n_vocab)

🔢 Vector Embedding

What is a vector?

A vector is a one-dimensional array (or list) of numbers. Each number in the vector is called a component or an element.

Example: a = [12,4,24]

Embedding

An embedding maps words or tokens to vectors of numbers in a way that similar meanings are placed closer together in a high-dimensional space.

Example

apple → [1.2, 3.4]
banana → [1.1, 3.5]
orange → [1.3, 3.3]

tiger → [7.8, 2.1]
lion → [7.7, 2.0]

In Below image

  • apple, banana, and orange are clustered together because they’re all fruits.

  • tiger and lion are close to each other but far from the fruit cluster — because they’re animals.

This shows that the embedding space captures meaning and categories just from how words are used in context — without being explicitly told “apple is a fruit.”

🧭 Positional Encdoing

Transformers have no built-in sense of word order, so positional encoding is used to give the model information about the position of each token in a sequence.

It adds a unique pattern to each token’s embedding based on its position in the input. This helps the model understand word order and structure in sentences.

Think of token embeddings as words without context. Positional encoding is like adding GPS coordinates to each word so the model knows where it appears in the sentence.

Example

🧠 Semantic Encoding

In machine learning, semantic encoding refers to transforming text into numerical representations (like embeddings) that capture the meaning of the words, phrases, or sentences — not just their appearance or order.

For example:

  • "He is very happy."

  • "He is extremely joyful."

These two sentences may look different, but semantic encoding tries to map them to similar representations, because their meanings are similar.

🔐 Encoder

The encoder is the part of a Transformer model (like BERT or GPT’s encoder side) that processes the input text and converts it into a numerical representation—usually a sequence of embeddings that capture the meaning and context of each word or token.

Example

Input Sentence: "The cat sat on the mat."

Step 1: Tokenization

The encoder first splits the sentence into tokens: ["The", "cat", "sat", "on", "the", "mat", "."]

Each of these tokens is mapped to a unique token ID, like: [101, 456, 987, 210, 101, 675, 102]

Step 2: Embedding

Each token ID is converted into an embedding vector. Example:

TokenEmbedding Vector
"The"[0.10, 0.20, 0.30]
"cat"[0.50, 0.60, 0.70]
"sat"[0.25, 0.35, 0.45]
"on"[0.15, 0.25, 0.35]
"the"[0.10, 0.20, 0.30]
"mat"[0.40, 0.55, 0.65]
"."[0.05, 0.15, 0.25]

🔁 Self Attention

Self-attention is a mechanism that allows the model to focus on different words in a sentence when understanding a particular word.

It helps the model understand the context of each word by comparing it with every other word in the sentence.

🔀 Multi Head Attention

Multi-head attention means that the model doesn’t just look at the sentence one way, but in multiple ways at the same time — like looking at it through different lenses.

Each “head” focuses on different parts of the sentence.

Example:

Sentence: “This is a book"

Let’s say we have 3 attention heads.

Each one pays attention to the sentence in a slightly different way:

HeadFocus Example
1️⃣Focuses on the subject → “This”
2️⃣Focuses on the object → “book”
3️⃣Focuses on grammar → “is”, “a”

🔓 Decoder

The decoder takes the numerical representation (from the encoder or previous tokens) and generates output, usually in the form of text, one token at a time.

🌡️Temperature

Temperature controls how random or confident a model is when generating text.

It’s a hyperparameter used during sampling (i.e., picking the next word/token).

Example:

Suppose the model thinks the next word should be:

WordProbability
"cat"0.6
"dog"0.3
"frog"0.1
  • Low temperature → likely chooses "cat".

  • High temperature → might choose "dog" or even "frog".

📅Knowledge Cutoff

The knowledge cutoff refers to the most recent date up to which the model was trained and can reliably provide information.
Any events, facts, or developments that occurred after this date are unknown to the model unless it has access to live tools (like web search)

Example:

If a model has a knowledge cutoff of July 2023, it:

  • Knows about events before or during July 2023.

  • Doesn't know about things that happened after (e.g., a 2024 cricket match result or new tech launches).

0
Subscribe to my newsletter

Read articles from Suraj Patel directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Suraj Patel
Suraj Patel