Understanding GPT

Before everything be clear in it - Chat Gpt, Gemini, and all these are the example of LLMs, which are generative in nature.

Let’s understand it more deeply in easy terms - Go to Chat GPT and ask - Hey. How are you?

It is giving us the output which seems magical, kind of there is a man who is talking to me in real-time. So what is happening here under the hood - all the magic happening here just because of the GPT(generative pre-trained transformer ), but in reality, all is maths and science. Let’s break it down and understand each part:

Generative --> It can create new content (like text, code, or answers) instead of just retrieving existing data.

Pre-trained → It is trained in advance on a massive amount of data before being fine-tuned for specific tasks

Transformer → A type of AI model architecture that processes and understands relationships between words in context.

So, GPT basically means a model that can generate content, already trained on lot of data, using the transformer architecture.

Example: Search anything on a search engine(Chrome) they are not generative or are not able to generate content.

But in here, they have a data index. They crawl the web and they have indexed the data, which helps in finding the relevant results, and that's why they are search engines. These are not capable of generating the content itself.

But in LLMs (Gemini and GPT), they are capable of generating text things based on our input they create sequences in real-time. Generative in nature.

Note: It generates the next sequence on the basis of pre-trained data (historical data, internet data, etc).

Till now, we understand what is generative and pre-trained. Let’s move to transformer, so here is what we see that it is a type of AI model architecture that processes and understands relationships between words in context, or easy way, it is a transformer, which is generative, and it works on pre-trained data. Transformers is Google's white paper - Attention is all you need. In this particular paper, released by Google, it has a transformer. All the LLMs are working on this architecture. This is the heart of every LLM out there.

So, these transformers are capable of predicting the next sequence of words and text to image.

But here, transformers really understand the text of the sequence, producing - NO, all these are stored in binary format. We need to understand the working of it works to understand it more.

Step 1: Encoding phase (input embedding)

When a user gives the input step first, is to tokenize it. Let’s say our input is Hello chaicode. Here, Hello is a separate token and chaicode is a separate token, then assigning it to the equivalent number. We have a master data dictionary inside it, which consists of all the words and their relation to the numbers. In this case, both Hello chaicodeare are mapped with numbers according to dictionary mapping, and every model has its own way of tokenization means no fixed algorithm.

You can visit to the Tiktokenizer to see how different models do tokenization. In our case, Hello, chaicode, there are 11 tokens and mapped to their equivalent number.

means they have something kind of a dictionary inside it, a vocab which helps in mapping. And every Model has their own vocab size. More vocab size leads to more input size and more complex tokenization.

Step 2: vector Embedding (turn IDs into vectors)

A vector embedding is just a way to turn words, sentences, or other data into numbers so that computers can understand them. After the tokenization, these numbers are arranged in a vector (a list of numbers), and words with similar meanings or related contexts will have vectors that are close together in this number space. Example: cat and dog -

cat and milk are near each other in this "number space" because cats often drink milk.
dog and pedigree are near each other because dogs eat Pedigree.

Here each token ID is mapped to a vector from a learned table. Notice - cat and milk vectors are very close (numbers are similar) → related meaning. dog and pedigree Vectors are very close to each other too. Relationship - Think of it like plotting them in a 3D map:

cat 🐱 and milk 🥛 are near each other.
dog 🐶 and pedigree 🍖 are near each other.
If you measure the distance between vectors, the smaller distance = more related.

In short:
Vector embedding = turning words into numbers so related things are close together in math space.
It’s like putting all words into a huge 3D map where meaning decides how near or far they are.

Step 3: Positional encoding (add word order)

Suppose we have this sentence: The cat chased the dog. In this case, if we only have the embedding, like similar -

"cat" → [1.2, -0.8, 0.5]
"dog" → [0.9, 0.1, -0.6]
"the" → [0.3, 0.7, -0.2] the LLM or model knows meanings, but it doesn’t know if “the dog chased the cat” or “the cat chased the dog” → both look the same set of words here.

Here is what positional encoding does- It adds the word information order (position numbers → 1st, 2nd, 3rd …) to the embeddings. So, each word vector = embedding (meaning) + position (order). same example with the numbers-

Position 1 (The) → Embedding [0.3, 0.7, -0.2] + Positional [0.1, 0.0, 0.1] = [0.4, 0.7, -0.1]
Position 2 (cat) → [1.2, -0.8, 0.5] + [0.2, 0.1, -0.1] = [1.4, -0.7, 0.4]
Position 3 (chased) → [1.0, 0.5, -0.3] + [0.3, 0.2, -0.2] = [1.3, 0.7, -0.5] Now, the model knows not just what the words mean, but also the sequence in which they appear. Note: Think of embedding as describing what ingredient is on a pizza 🍕 (cheese, tomato, olives).
But positional encoding tells you where on the pizza each ingredient is placed.

Step 4: Self attention(single head):

Example: you are reading a book and the sentence is - "The cat drank milk because it was thirsty." Question here is - What does “it” refer to? 🤔 and we know that "it" refers to the cat not to milk and to figure it out, our brain looks at all the words and last finds which word is most related to "it".

In the same way self self-attention does to the model after the positioning. It allows every word to look at every other word and to decide -

Which word are important.
How much attention i need to give them.
create a context meaning.

How does it works internally: Each word embedding goes through 3 “lenses”:

Query (Q): What am I looking for?
Key (K): What do I have to offer?
Value (V): What info do I pass on? The model compares Q & K (like asking “Do I match?”), then uses V (the actual info) if the match is strong.

Step 5: Multi-head attention(rich understanding)

It takes the different sentences at the same time and combines them all for a rich understanding. It already has contextual understanding from single-head attention, plus it does parallelization on top.

Instead of doing this once, the model does it many times in parallel (“heads”) -

One head may focus on the subject ↔ verb.
Another head may focus on adjectives ↔ nouns.
Another may focus on long-distance words in the sentence. Then all are combined → giving a rich understanding.

Step 6: Add&Norm (Residual Connection + Normalization): repeat mode

This is for the safety + balance

Add: The original input (before self-attention) is added back to the attention output.
Example: You write an essay, then add back your original notes so you don’t lose the main idea.
Norm (Normalization): Keeps the numbers balanced, so nothing explodes or shrinks too much.
Example: If one student talks too loudly and another too softly, the teacher tells both to speak at a normal tone.

Step 7: Feed Forward (Small Neural Network Layer)

Think of it like a mini brain inside each word position.

After attention mixes the words, each word gets passed through a little neural network (the same for every word).
It helps the model learn more complex patterns. Example:
Sentence = “The cat drank milk because it was thirsty.”
Self-attention already told us "it" relates to "cat."
The feed-forward network then helps refine that meaning further — like "cat is an animal, thirsty relates to animals, so this connection is stronger."
⚡ Why????????????
It adds depth, like a second layer of reasoning. Attention tells us who should talk to whom, and feed forward helps them say something smarter.

In End de-tokenize human-readable output🛸

Text
 → Tokenize
 → IDs
 → Embeddings + Positions
 → [Q,K,V] → Attention → Multi-Head
 → Add & Norm
 → Feed-Forward (MLP)
 → Add & Norm
 → (repeat blocks N times)
 → Decode (greedy/top-k/top-p/temperature)
 → Next token
 → Loop for more tokens
 → Detokenize → Output text

The Code Whisperer: Understanding- Power of Generative Pre-trained Transformers