Let’s break down a key concept behind models like GPT: Attention — the secret sauce that helps models understand words in context.

What is Attention?

Think of attention as spotlighting the right words in a sentence.
For example, take this sentence:

"The cat was being chased by the dogs."

If someone asks: "Who was chasing?"
You instantly focus on “dogs”. That’s attention — focusing on the relevant part of the sentence to make sense of it.

Attention uses 3 characters to compute the “attention” sort thing. Those 3 things are Queries, Values , Keys

Query (Q) - Its like a question each word asks as in “Who should I pay more attention to understand myself better ?“
Key (K) - Its like a tag each word carries about what it offers , per say “Here is the info i hold.”
Value (V) - Its like the actual information that can be shared as in “Here’s my content, if u need it”

Explained here with a simple Example !

Let’s use the sentence:

“The cat sat on the mat.”

Focus on the word “sat”.
Now imagine “sat” becomes a Query. It asks,

“Who can help me understand what I mean here?”

All the other words — “The”, “cat”, “on”, “the”, “mat” — are turned into Key + Value pairs.

Now, “sat” will:

Compare its Query with the Keys of all other words (using dot product).
Get attention scores: Higher if the Key matches the Query.
Pass the scores through Softmax — to turn them into weights like:
Use these weights to mix the Values of those words — giving “sat” a context-aware meaning.

cat -> 70%
mat -> 20%
others -> 10%

Why is it called Self-Attention ?

Because every other word goes through this process - comparing itself with all others (including itself). This mutual attention gives the model a deep layered understanding of how words relate.

That’s the core idea behind self-attention — the building block of Transformers.

🧠Attention Made simple

What is Attention?

Explained here with a simple Example !

Why is it called Self-Attention ?

Subscribe to my newsletter

Immanuel Dsouza

Immanuel Dsouza