How Chat GPT Works

GPT Stands For Generative Pre-Trained Transformer, which predicts the next word based on current context.
Generative Pre-Trained Transformer here means generating something based on provided data using a neural network architecture called the Transformer.
Updating LLMS requires massive GPU resources and computational power, it's not feasible to retrain them frequently. As a result, they have a knowledge cutoff, which means they are not up to date with recent events

Ever wondered what happens behind the scenes after you give a prompt to GPT?

There are several phases involved before you get a response. Let's dive into the details.

Tokenization Process

Tokenization is a process of breaking down the input text into a set of tokens, which can be words, subwords, or characters.

In simple words, let's say we have a user input, “Babu Rao Chai Piyega Chai?” and we have a dictionary in which each word is mapped with a unique number. If we break down, 'Babu' might be mapped to 200264, 'Rao' to 17360, 'Chai' to 200266, and 'Piyega' to 21721.

When we tokenize 'Babu Rao Chai Piyega Chai?', we would get the following sequence of tokens like this

200264, 17360, 200266, 33, 21721, 171935, 1036, 1361, 398, 3403, 11420, 1036, 1361, 30, 200265, 200264, 173781, 200266

Every model has a vocabulary size, which refers to the number of unique tokens generated by the model. For example, if we check the vocabulary size of the 'GPT-4O model,' it has 200,019.

Let’s Code Tokenizer

 import tiktoken

encoder=tiktoken.encoding_for_model("gpt-4o")

print("Vocab Size",encoder.n_vocab)

text="Babu Rao Chai Piyega Chai?"

babu_rao_kai_token=encoder.encode(text)
print("Babu Rao ka text",babu_rao_kai_token)

Babu Rao felt happy after seeing his token

Vector Embeddings

Vector embedding is a process of finding semantic meaning of tokens ,not just by name, but by what they mean and how they relate to each other. we turn them into vectors in a high-dimensional space.

🧠 Understanding Vector Embeddings — The Hera Pheri Way

Think of vector embeddings like a semantic map of the Phir Hera Pheri universe. Each character (token) is placed in this high-dimensional space based on meaning, behavior, and vibe.

If you set off on a spaceship from "Raju", you’ll drift toward:

🪙 "150 Wala Kachra Seth"
🎩 "Totla Seth"
💰 "Lalach" (greed)
Why? Because Raju's vibe is all about greedy hustle and falling for scams.

But if you launch from "Shyam", you’ll end up near:

👩‍💼 "Anuradha"
🧠 "Logic"
🪨 "Thoda Soch Samajh ke"
Shyam's vector is closer to caution and reasoning—even if he gets dragged into chaos.

Let’s code Vector mebedding using genai

import os
from dotenv import load_dotenv
import google.generativeai as genai


load_dotenv()


genai.configure(api_key=os.getenv("GEMINI_API_KEY"))

response = genai.embed_content(
    model="models/embedding-001",
    content="Babu Rao Chai Piyega Chai?",
    task_type="SEMANTIC_SIMILARITY"
)

embedding = response["embedding"]
print(f"Vector length: {len(embedding)}")
print(f"Sample: {embedding} ...")

Vector length: 768
Sample: [0.06781081, -0.054971334, -0.036417995, -0.018049834, 0.04838467] ...

Positional Encoding

Positional encoding is used to preserve the order of words in a sentence when generating vector embeddings.

For example, we might get the same unique tokens for both:

"Babu Rao Chai Piyega Chai?"
and
"Rao Chai Chai Babu Piyega"

Since the same tokens are used, their embeddings could be the same ,but the meaning has changed due to the different word order.

Self-Attention & Multi-head Attention

Self-attention allows tokens to interact with each other to refine their embeddings based on context. Instead of using a single self-attention mechanism, we use multi-head attention, which allows the model to focus on different aspects of the tokens simultaneously.

🎯 Example:

Consider these two sentences:

"Babu Rao Chai Piyega Chai?"
"Rao Chai Chai Babu Piyega"

Both have the same words, but in a different order and that changes the meaning completely.
Without positional encoding and self-attention, a model might treat both the same, which is clearly wrong.

🔍 How Multi-Head Attention Makes It Better:

Now imagine multiple heads doing this attention in parallel:

Head 1 focuses on who is doing the action → Babu Rao ↔ Piyega
Head 2 focuses on what is being acted upon → Chai ↔ Piyega
Head 3 catches repetition → Chai ↔ Chai
Head 4 might even question → "Chai?"

Multi-head attention does this from different aspects helping the model understand who’s doing what, to whom, and in what context even if the word order changes.

After all this, you might feel like "uff, it's a lot of gargng!", just like Babu Rao would say in one of his classic moments.

After multi-head attention processes the tokens, they are passed to the neural network for further refinement, where training happens using forward and backpropagation. The End

Decoding Ai Jargons With chai