GPT and Transformer model workflow

Tagline: Ever wondered how GPT(Generative Pretrained Transformer) understands and responds like a human? Let’s break down the buzzwords over a cup of chai.

What is GPT?

GPT (Generative Pre-trained Transformer) is a type of AI that can generate human-like text.
It's trained on tons of text from the internet to understand and generate natural language.
It’s based on the Transformer model — a breakthrough AI architecture introduced by Google in 2017.

Big Picture Flow of GPT:

Text → Tokenize (split into pieces)
Tokens → Vectors (number representation)
Vectors → Transformer Layers
Transformer uses Attention to understand
Generates next word (token)
Repeats until done!

What is a Transformer?

A Transformer is the brain behind GPT.
It reads input, understands relationships between words, and generates meaningful output.
Invented in 2017 (by Google), it's faster and more powerful than older models like RNNs or LSTMs.
The Transformer architecture was originally designed for machine translation (like Google Translate), and is now the core foundation of models like GPT.

Encoder – Turning Words into Meaningful Numbers

An Encoder is the first part of a Transformer model (used in tasks like translation or search). Its job is to take a sentence as input and convert each word into a numerical representation — also called vectors or embeddings.

These vectors capture the meaning, position, and context of words in a sentence, making it easier for the model to understand relationships between them.

Example:
Let’s take the sentence: “The cat sat on the mat”

In a model like GPT-4.o, this sentence might get encoded into token IDs like: [976, 9059, 10139, 402, 290, 2450]

Decoder – Generating Text from Understanding

The Decoder is the part of a Transformer that takes the encoded information (vectors from the encoder) and generates new text, one token at a time.

In models like GPT, which are built specifically for text generation, the decoder is used without the encoder — because GPT doesn't translate between languages, it just predicts the next word/token based on everything it has seen so far.

Example: Input: “The cat sat on the”

The decoder predicts: mat” (based on training data and attention patterns)

Output: “The cat sat on the mat”

Vector – The Numeric Form of Language

A vector is just a list of numbers that represents a word, phrase, or token in a way a computer can understand.

Since machines don’t understand human language, every word we give to a model like GPT is converted into a vector of numbers — like turning "cat" into something like: [0.12, 0.88, -0.45, 0.22, ...]

Vector Embedding – Adding Meaning to Numbers

A vector embedding is a specially trained vector that captures the meaning and relationships of a word with others.

In vector space:

Words that have similar meanings or usage are close together.
Words that are unrelated are far apart.

For example:

The vectors for "king", "queen", and "royalty" will be close together in the vector space.
But "cat" and "laptop" will be far apart — because they mean totally different things.
Just like "man" relates to "woman", GPT learns that "boy" relates to "girl", so given "Men - Women, Boy - ?", it predicts "Girl".

What is Vocabulary Size (Vocab Size)?

Vocab size = how many unique tokens GPT understands.
GPT-4.o has a vocab size of ~200,000 tokens!

It means it can recognize up to 200k pieces of language – full words, subwords, symbols, emojis, etc.

Tokenization – Breaking Text into Pieces

GPT breaks every sentence into smaller chunks called tokens.

Example Text: "This is GPT-4.o example"

Token count: 18 Tokens: [200264, 17360, 200266, 851, 382, 329, 555, 19, 21465, 4994, 200265, 200264, 1428, 200266, 200265, 200264, 173781, 200266]

Check tokenization here

Positional Encoding – Knowing Word Order in a Sentence

Transformers (like GPT) process all words at once — not in order like humans read (left to right).
But word order matters!
For example:

“The dog chased the cat” is not the same as
“The cat chased the dog”

This is where Positional Encoding comes in — it helps the model understand which word came first, second, third, etc., by adding position-based values to the word vectors.

**Semantic Meaning – Understanding the Meaning, Not Just the Words**

Semantic meaning is about what words mean in context — not just what they look like or how they’re spelled.

GPT doesn’t just read words; it tries to understand the meaning behind them using vector embeddings, context, and relationships.

For example: “ICICI BANK” & “RIVER BANK”, both have same word “BANK” but have different meanings.

Self-Attention – How GPT Focuses on Important Words

Self-Attention allows GPT to look at all the words in a sentence and figure out which ones are most important to each other — no matter where they appear.

How it works (Simplified):

For each word, the model asks:
“Which other words do I need to look at to understand this one?”
It gives each word a weight (importance score) — higher weight = more attention.
Self-Attention lets the model decide which words matter most to each word.
It helps GPT understand long sentences, context, and word relationships — even across many words.
It’s the heart of the Transformer, making it smarter than older models.

Multi-Head Attention – Looking at Everything, From Different Angles

While Self-Attention helps GPT focus on the most relevant words in a sentence,
Multi-Head Attention takes it a step further — it lets the model look at the same sentence in multiple ways at the same time.

Each “head” learns to focus on a different relationship between words — like grammar, meaning, or emotion.

Example:

Sentence: “The doctor who treated the patient was very kind.”

One attention head might focus on: "doctor" ↔ "treated" (subject-action)
Another head might focus on: "patient" ↔ "treated" (object-action)
Another might focus on: "doctor" ↔ "kind" (who was kind)

By combining all these heads, GPT gets a richer understanding of the sentence!

Softmax – Turning Scores into Probabilities

When GPT is choosing the next word to generate, it uses Softmax to decide how likely each word is based on the context.

Softmax turns raw numbers (called logits) into probabilities that add up to 100% — so GPT can pick the best next word, but also have a chance to pick creative alternatives. It’s what makes GPT both smart and slightly creative.

Temperature – Controlling How Creative GPT Gets

Temperature is a setting that controls how random or predictable GPT’s responses are when choosing the next word.

Low temperature → more focused, logical, and predictable answers.
High temperature → more creative, diverse, and surprising responses.

Knowledge Cutoff

GPT doesn’t learn continuously. It has a training data cutoff.

📌 GPT-4.o may only know up to April 2023
So it may not know about events after that.

🚀 Explore my hands-on GenAI experiments with tokenization, vector embeddings, and more:
👉https://github.com/Ajay-Goswami/GenAI

Decoding AI Jargons Over Chai ☕ – A Simple Guide to How GPT Works