Have you ever wondered how AI like ChatGPT can write poems, code, or even answer complex questions? At First i couldn’t wrap my head around it, it felt like I was peeking into a box of mysteries except instead of secrets, it was all math and matrices. So, I decided to dig deeper, and here’s what I found out.

What is GPT?

You’ve probably used ChatGPT by now. It’s essentially a chatbot powered by something called GPT—but what exactly does GPT mean?

GPT stands for Generative Pre-trained Transformer. It’s a type of AI model that can generate human-like text based on the data it was trained on.Think of GPT as a very smart storyteller. But instead of using emotions and personal memories, it relies on patterns in data. It’s powered by a special model architecture called a Transformer, which changed the game for how machines understand language.

GPT also has a knowledge cutoff, which means it doesn’t know anything that happened after a certain date. So if you ask it about the current weather or recent events, it probably won’t have a clue.

What is a Transformer?

It all started when Google released a paper in 2017 called “Attention is All You Need.” That’s when the Transformer architecture was introduced—a deep learning model that became the backbone of modern NLP (Natural Language Processing).

So let’s break down how a Transformer works.

1.Input and Tokenization

Whenever you give text to a model, lets say, “the cat sat on the mat”—the first step is to tokenize it. That means breaking the sentence down into smaller pieces (tokens) and mapping each one to a number.

For example:
[‘The‘, ‘cat’, ‘sat’, ‘on’, ‘the’, ‘mat’] ---> [2, 1175, 4401, 3173, 611, 573]

This is possible because the model has a vocabulary, a large list of all the tokens it knows. The vocab size refers to the total number of these unique tokens.

For example, the GPT-4 model has a vocabulary size of around 200,000 tokens—meaning it can recognize and work with about 200,000 unique pieces of language.

2.Vector Embeddings

Once we have tokens, they’re converted into vector embeddings—mathematical representations that carry meaning.

Imagine placing each word as a dot on a map. Words with similar meanings—like “king” and “queen”—end up closer together. If “king” to “man” gets you a certain direction, then moving that same way from “woman” might land you at “queen.” That’s the magic of embeddings—they capture relationships between words.

3.Positional Encoding

Here’s a problem: embeddings treat all words the same, no matter where they appear in a sentence. But word order matters.

Compare:

The cat sat on the mat
The mat sat on the cat

Same words, totally different meaning.

Since Transformers don’t inherently understand order, we add positional encodings—extra information that tells the model the position of each word in the sequence.

4.Self Attention

This is where things get cool.

Self-attention lets the model look at all the words in a sentence and figure out which ones are important when trying to understand a particular word.

Take the sentence:
"The cat sat on the mat because it was tired."
The model uses attention to figure out that “it” refers to “cat.”

Self-attention is what helps GPT stay coherent and context-aware.

5.Multihead Attention

Now imagine running multiple self-attention mechanisms in parallel. That’s called multi-head attention.

Each "head" focuses on a different aspect of the sentence—one might look at syntax, another at meaning, another at position. Together, they give a richer understanding of the input

6.Feed Forward Networks

Once the model has figured out all the relationships, the data goes through a feedforward neural network—just layers of math doing calculations to guess the next word.

At each step, the model makes a prediction, compares it to the correct answer, calculates the loss, and updates its weights. This is how it learns.

Layer normalization helps keep things stable during this training, and yes—this is the part where your GPU starts sweating. The model has multiple stacked layers doing all this in sequence.

6.Softmax and Temperature

Once the model has a list of possible next words, it uses softmax to turn those into probabilities—basically figuring out which words are most likely to make sense next.

Then there’s temperature, which controls how creative the model gets. A low temperature (like 0.3) keeps things safe and predictable. Crank it up to 1.0 or more, and the model starts taking risks—getting creative, surprising, and sometimes a little weird.

It’s like adjusting the AI’s mood: chill and careful or wild and imaginative.

So What’s the Takeaway?

These were some common AI concepts that kept showing up in my research. And after learning about them, I realized something important:

GPT isn’t magic—it’s math with meaning.

It’s trained to find patterns, understand context, and predict what comes next based on tons of data. What makes it amazing is how all these components—attention, embeddings, softmax, temperature—come together like instruments in an orchestra.

Each one plays its part to create what feels like an intelligent conversation.

And now, whenever I talk to ChatGPT, I can’t help but think: I’m not just chatting with AI—I’m watching a symphony of math in motion.

Decode AI Jargon with Chai ☕️

Table of contents