ChatGPT is one of the most widely used large language models (LLMs) today. When it was released in 2022, it felt almost magical — users could ask complex questions, receive detailed explanations, and even generate working code with just a simple prompt. But have you ever wondered what actually powers ChatGPT? Once you understand how it works, you'll realize it’s not magic.

Predicting the Next Word

At its core, ChatGPT is essentially a highly advanced word predictor. Whatever you input, ChatGPT tries to predict the next word (or more precisely, the next token). For example, if you write:

"Hi, my name is _____"

You might guess the next word is "Akshat" — if you knew the author of this article. ChatGPT does something similar. Based on context and training data, it predicts the most likely next token. This ability to generate fluent, coherent language comes from how it was trained and the model architecture it uses.

GPT: Generative Pre-trained Transformer

The name "ChatGPT" reveals a lot about its inner workings. GPT stands for Generative Pre-trained Transformer:

Generative: It can generate new content, whether that’s answering a question, writing an essay, or summarizing a document.
Pre-trained: It’s trained on a massive dataset of text from books, websites, code, and more.
Transformer: The model architecture that powers GPT. Introduced in the paper "Attention is All You Need", the Transformer revolutionized deep learning by enabling better handling of long-range dependencies in text. For eg. “The book that the professor who the students admired wrote was fascinating“ The subject "book" is connected to the verb "was," but they are separated by many words. Understanding this sentence correctly requires keeping track of that long-distance relationship — that's a long-range dependency.

Real-Time Information and Knowledge Cutoff

Ask ChatGPT, "What is the weather in Jaipur today?" and it may give you a real-time answer — but only if it has access to tools like web search. Otherwise, it can’t answer real-time questions because its knowledge is frozen at a certain point. For GPT-4 Turbo, that knowledge cutoff is June 2024.

How the Transformer Works

Let’s break down the Transformer architecture step-by-step.

1. Input Embedding

Tokenization

First, the input is broken down into tokens. A token can be a word, subword, or even a character depending on the tokenizer. For simplicity, let’s assume each word is a token.

Input: "The cat sat on the mat" Tokens: [The, cat, sat, on, the, mat] → 6 tokens

These tokens are then mapped to numbers based on a predefined vocabulary.

Vocabulary Mapping: [The - 10, cat - 13, sat - 45, on - 32, the - 11, mat - 46] Token IDs: [10, 13, 45, 32, 11, 46]

You can visualize tokenization at tiktokenizer.vercel.app.

Embeddings and Semantic Meaning

These token IDs are then converted into high-dimensional vectors — called embeddings — which capture semantic meaning. For example, the word "bank" in:

"The river bank"
"The ICICI bank"

has different meanings. Embeddings help the model distinguish between these meanings based on context.

Imagine a 3D space where words are positioned based on meaning. Words like "Queen" and "Woman" may be close, while "King" and "Man" lie in a similar vector direction. This analogy explains how relationships between words are encoded geometrically.

Positional Encoding

Transformers don’t process text word-by-word. Instead, they handle the entire sequence in parallel. But this parallelism means the model has no built-in sense of word order. To solve this, positional encodings are added to the token embeddings — not replacing them, but combining with them.

A token embedding is a vector that represents the meaning of a word. A positional encoding is another vector, based on the position of that word in the sentence. These two vectors are added together to form a new vector that carries both the meaning of the word and its position.

For example, let’s say:

"cat" is tokenized and gets an embedding like [0.2, 0.4, 0.1]
and it’s the 2nd word in the sentence, so the positional encoding for position 2 is [0.05, -0.1, 0.03]

Then the final vector passed into the Transformer would be:

[0.2 + 0.05, 0.4 + (-0.1), 0.1 + 0.03] = [0.25, 0.3, 0.13]

This way, the model knows it’s seeing "cat" and that it’s in the second position.

There are different ways to compute positional encodings — some use fixed sine/cosine patterns, others use learnable embeddings — but the goal is the same: inject order information into otherwise orderless input.

2. Self-Attention

The real magic of Transformers lies in the self-attention mechanism. It allows each token to consider every other token in the sequence when understanding its meaning in context. This means that instead of processing words in isolation or strictly in order, the model can dynamically focus on the most relevant parts of the sentence for each word.

Take the sentence:

"The cat sat on the mat because it was tired."

To figure out what "it" refers to, you need to connect it to "cat". Self-attention lets the model do just that — it assigns more importance (or attention) to words like "cat" when interpreting "it", helping the model grasp that "it" likely means the cat, not the mat. This deep contextual understanding is what makes Transformers so powerful for language tasks.

3. Multi-Head Attention

Instead of performing attention just once, Transformers do it multiple times in parallel — a technique known as multi-head attention. Each of these parallel attention layers is called a head, and each head has its own learnable weights (or parameters). Because each head learns to focus on different patterns or relationships — like grammar, meaning, or word position — the model can capture various aspects of the input simultaneously. The outputs from all heads are then combined, giving the model a richer and more nuanced understanding of the context.

4. Feed-Forward Networks

After the attention layers, the output goes through a feed-forward neural network — essentially a few dense layers applied independently to each token. This helps further refine the representations.

5. Stacking and Layers

Transformers stack multiple such attention and feed-forward blocks on top of each other, building increasingly abstract representations at each level. imagine you have many of these layers stacked on top of each other—like layers in a cake.

The first layer understands basic stuff (like word meanings).
The next layer builds on that, figuring out how words relate in a sentence.
Higher layers can start understanding more abstract ideas, like sarcasm, emotion, or context over several sentences.
- Early layers might understand the word “bank” as just a word.
- Later layers might realize that in “he sat by the bank of the river,” bank means the edge of a river — not a financial institution.
- Even later layers might understand the whole sentence’s purpose, like giving a peaceful image.

That’s abstraction — moving from raw words to deeper understanding.

6. Output Generation

Finally, the model predicts the next token in the sequence. This process involves several steps:

Linear Transformation: The final hidden state — which is the model’s internal representation of each token after processing all the layers — is passed through a linear (dense) layer that maps it to the size of the vocabulary. This produces a vector of raw scores (called logits) for each possible token. These scores indicate how likely each token is to come next, before applying a softmax to convert them into probabilities.
Softmax Activation: These logits are passed through a softmax function, which converts them into probabilities. The token with the highest probability is selected as the next token in the sequence — but that can make the model predictable and repetitive. So, Sampling strategies are used which are smarter, more creative ways to choose the next word — instead of always picking the top one.

They introduce randomness or controlled variation to make the text more interesting or natural.
Temperature Parameter: The temperature setting controls the randomness of predictions. A lower temperature (e.g., 0.2) makes the model more confident and deterministic, sticking to high-probability choices. A higher temperature (e.g., 1.0 or above) increases randomness, allowing more diverse and creative outputs by making the probability distribution flatter.

You can control this parameter when using the OpenAI API, for example, by setting temperature in your request payload. It’s commonly used by developers and researchers when generating text through code — such as in chatbots, creative writing tools, or content generators — where controlling the creativity level is important. For instance:

{ "model": "gpt-4", "prompt": "Write a sci-fi story about Mars.", "temperature": 0.9 }

The model continues this token-by-token prediction process until it generates a complete response or hits a stopping condition.

Understanding ChatGPT demystifies a lot of the AI hype. It's not magic — it’s mathematics, data, and clever engineering. While it’s a powerful tool, it's still just that — a tool. Knowing how it works empowers you to use it more effectively and responsibly.

Understanding the model behind Chat-GPT