đź§ Attention Made simple


Let’s break down a key concept behind models like GPT: Attention — the secret sauce that helps models understand words in context.
What is Attention?
Think of attention as spotlighting the right words in a sentence.
For example, take this sentence:
"The cat was being chased by the dogs."
If someone asks: "Who was chasing?"
You instantly focus on “dogs”. That’s attention — focusing on the relevant part of the sentence to make sense of it.
Attention uses 3 characters to compute the “attention” sort thing. Those 3 things are Queries, Values , Keys
Query (Q) - Its like a question each word asks as in “Who should I pay more attention to understand myself better ?“
Key (K) - Its like a tag each word carries about what it offers , per say “Here is the info i hold.”
Value (V) - Its like the actual information that can be shared as in “Here’s my content, if u need it”
Explained here with a simple Example !
Let’s use the sentence:
“The cat sat on the mat.”
Focus on the word “sat”.
Now imagine “sat” becomes a Query. It asks,
“Who can help me understand what I mean here?”
All the other words — “The”, “cat”, “on”, “the”, “mat” — are turned into Key + Value pairs.
Now, “sat” will:
Compare its Query with the Keys of all other words (using dot product).
Get attention scores: Higher if the Key matches the Query.
Pass the scores through Softmax — to turn them into weights like:
Use these weights to mix the Values of those words — giving “sat” a context-aware meaning.
cat -> 70%
mat -> 20%
others -> 10%
Why is it called Self-Attention ?
Because every other word goes through this process - comparing itself with all others (including itself). This mutual attention gives the model a deep layered understanding of how words relate.
That’s the core idea behind self-attention — the building block of Transformers.
Subscribe to my newsletter
Read articles from Immanuel Dsouza directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
