We have all heard about GPT, or simply ChatGPT, which generates solution to all our problems. But have you ever wondered how it accomplishes this? What is the fundamental logic behind it?

Don't worry, we'll go through this article step by step to understand how it works.

So, what is GPT?

GPT stands for Generative Pre-trained Transformer. From "Generative + pre-trained," we understand that it creates data based on previously trained information. It doesn't have real-time data, so it can't tell you today's weather or which team won this IPL season. It only has data up to the point when its training was completed. This point is known as Knowledge cutoff. Now, let's talk about the next part, "Transformer." It's the most important part of any text-generating model available today. It is a neural network architecture that works on the principle of "next word prediction."

From the word "prediction," we can infer that probabilities are involved. But how does an English statement we write, for example, "Hi, how are you?", get transformed into mathematics, resulting in the predicted output "I am fine, please tell me how I can help you?". So, this brings us to our first concept “Tokenization“.

Tokenization

Tokenization is the process of breaking down the input text into smaller, more manageable pieces called tokens. These tokens can be a word or a sub word. In simpler terms, we break a bigger sentence into smaller words which can be mapped to numbers. Let's look at an example of how tokenization works for a simple sentence like "How are you?".

From the above image we can see that how the words are mapped to numbers.

“How “→ 5299
“are “→ 553
“you “→ 481
“? “→ 1423
So, now we have seen how words are mapped to numbers. You can also explore how different models generate tokens for various words using Tiktokenizer. But this raises a question: how are these numbers generated? Is it random? This leads us to our next concept: vocab size.

Vocab Size

It refers to the number of unique tokens that a tokenizer has learned during the training phase. Each model has a different tokenizer trained to map tokens to a unique ID. For example, GPT-4o has a vocab size of 200,019, and Llama 3.1 has 128,000 tokens. But the question is, how do these unique token IDs or numbers convey the meaning of a word in the real world? This leads us to our next concept: vector embeddings.

Vector Embeddings

Vector embeddings is a method to turn words and sentences into numbers that capture their meaning and relationships. This helps us understand the meaning of the word in the real world. Each token or word is represented as a vector in an n-dimensional vector space. For example, the word "King," which has a token ID of 6962, is embedded in a 768-dimensional vector space. This means 6962 is mapped to a vector like [0.01, -0.03, ..., 0.07]. So, each token in the sentence will be mapped to 768-dimensional vector space.

Where do we get this vector from? We have an embedding lookup table that provides the same vector representation for the same token. In the picture we can see how the sentence “Ganga is the longest river “
is tokenized and mapped to a 768-dimensional vector.

These embeddings also help to cluster words with similar meanings close to each other. From the diagram below, we can see that the names of rivers form a cluster near each other in the vector space, while the mountain "Mount Everest" is located far from them.

Referring to the transformer architecture in Figure 1, these embeddings are the output of the input embeddings block. But now you might be wondering, if we change the position of each word, will it result in the same vector embeddings? How do we differentiate between them? For example, "Ganga is the longest river," "Ganga river is the longest," and "Ganga the is longest river" would give the same vector embeddings. So, how do we differentiate between these and take into account the position of the words? This leads us to our next concept: Positional Encoding.

Positional Encoding

Positional Encoding provides information about the position of tokens in a sentence. As humans, we understand the order in which tokens should appear to form a meaningful sentence, and we know how changing their positions can change the meaning. Transformers grasp this concept of position using positional encoding. This positional encoding function is a sinusoidal function that gives a value in the range [0, 1] which is added to input embedding to generate the final embeddings of the input.

So, now we know how the sentences are tokenized and converted to vector embeddings by capturing both the semantic meaning of the tokens and their position in the input sequence. Now the question is how the tokens within a sequence are related. For example, in the sentences "The river Bank" and "The ICICI Bank," how does the transformer understand the context of the word "Bank" in each sentence? So this brings us to our next concept Self Attention Mechanism

Multi Head Self Attention

Self-attention allows each token to communicate with other tokens and capture contextual info and relationships between words. It helps to find relationship within the sequence. This is the core component of the transformer block. Each token's embedding vector is projected into three subspaces: Query (Q), Key (K), and Value (V).

Query: This represents the token we want to learn more about.
Key: This represents the possible tokens it can relate to for more contextual information, helping to identify which tokens are important.
Value: This represents the actual information carried by those tokens, used to update the original token's understanding.

By using these QKV values, the model can calculate attention scores, which determine how much focus each token should receive when making predictions. In a transformer model this calculated across multiple heads and repeated over a number of transformer blocks to get the updated embeddings that catches all the contextual level information about the token.

Then next comes how we add more richer patterns, features and semantic meaning to our model. So this bring us to our next concept Feed-Forward Network

Feed-Forward Network

It is a simple neural network that's applied individually and identically to each token's representation. It's used to add non-linearity and help the model learn richer patterns after attention. For each token, it projects it to a higher-dimensional space to add more features, applies non-linear functions on it to capture complex pattern, and then compresses it back to the original dimension.

Add & Normalisation Layer

In this Layer the original input is added to the output to ensure that it remembers what it already knew along with other complex features and patterns. We can understand the need for this layer with a simple analogy. Let's say you are given a math problem for homework, and you solve it using one method. The next day, your teacher explains two more methods to solve the same problem, and you write them in your notes. Now, you have three methods for the same problem. In the future, if you encounter an unfamiliar problem and can't use the two new methods, you can always fall back on your original method to solve it.

Linear Transformation and Softmax Probability Function

The output from the series of transformer blocks is then passed through a linear transformation to project it into a space with dimensions equal to the vocabulary size. For example, if the vocabulary size is 128,000, it is projected into a 128,000-dimensional space. Each token receives a logit score indicating how likely it is to be the next word. Softmax converts these scores into probabilities, allowing us to choose the next token based on these probabilities.

Temperature

In the final step of predicting the next word , this temperature plays a key role.

temp=1 : softmax behaves normally
temp<1 : The probability distribution becomes sharper, making the model output more predictable and less creative.
temp>1 : The probability distribution becomes flatter, making the model output more creative.

Let's walk through a quick example to understand how it predicts the next word in the sequence.
Let's use the input sequence "Name the longest river of India." Below is a step-by-step process of how the next word is predicted from probabilities computed by the softmax function.

Conclusion

From this blog, we’ve gained a clearer understanding of how language models generate answers — not through magic, but through the power of mathematics and deep learning. Every prediction is the result of learned patterns, attention mechanisms, and probability distributions working together in a structured and logical way.

Demystifying AI Jargons 🤖🧠

Table of contents