If you’ve ever wondered how AI models like GPT-4 seem to “understand” language so well, the secret sauce is attention—and Chapter 3 of Sebastian Raschka’s Build a Large Language Model (From Scratch) dives deep into this powerful idea.

The chapter kicks off with a simple insight: not every word in a sentence matters equally when predicting the next word or figuring out meaning. Think of attention as an intelligent spotlight, allowing the model to dial up focus on more relevant words, and dial down the rest. Instead of treating every word the same (as old-school models might), attention weighs what’s important at each step.

Early on, you’ll learn about three main characters in the attention play: queries, keys, and values. Each word in a sentence gets mapped to these three vectors using trainable weight matrices. The big idea is that a word (the query) asks questions to all words in the sequence (the keys), and the answers you get (the attention scores) determine how much each word’s information (the value) should count.

This all comes together in self-attention, where every word can peek at every other word, including itself, to build a stronger understanding of meaning. The technical part? You take each word’s query, dot it with every key, scale, softmax the scores so they add up to one, and finally blend the words’ values based on those scores. For example, to calculate attention for a word, you project embeddings into those three vector types, run all the dot products, apply a softmax for clean probability-like weights, and create a context vector by mixing the values accordingly.

Paragraph 5: Attention Calculation Details To calculate attention:

Project each token’s embedding into query, key, and value vectors.
For a given token, compute the dot product between its query and the keys of all tokens, producing raw scores.
Scale these scores by the square root of the key dimensionality to stabilize gradients.
Apply the softmax function to obtain normalized attention weights.
Multiply each value vector by its corresponding attention weight and sum them, producing a context vector that fuses information from across the sequence based on learned relevance.

But attention doesn’t stop there. The book quickly steps up to “trainable” attention: those query, key, and value weight matrices aren’t just arbitrary—they’re learned during training to fit your language data, making the model flexible and powerful. Even cooler, you can choose different sizes for the output of this mixing, giving your model more expressive oomph.

One of the most genius tweaks is causal attention, which is absolutely vital for text generation. Imagine you’re writing a sentence—at any point, you should only know what’s already been written, not what comes next. Causal attention uses a clever mask to block the model from peeking ahead, ensuring the model is playing fair and not “cheating” by using future words to predict present ones. That mask is just an upper-triangular matrix full of zeros and ones, enforcing this rule wherever it’s needed.

As you round out the chapter, you’ll learn that attention can run in multiple “heads” at once—a bit like having a team of detectives, each looking for different patterns or relationships in the sentence. The book even touches on ways to keep these computations efficient and scalable, which is crucial when you’re processing massive amounts of data.

In short, Chapter 3 is the perfect introduction to why modern language models are so good at understanding and generating text. It takes you from simple intuitions to hands-on implementation, showing that attention isn’t just a buzzword—it’s the very heart of how large language models work their magic.

Thank you for reading,

Abhyut Tangri

Summary of Chapter 3 from Building LLMs from Scratch

Subscribe to my newsletter

Abhyut Tangri

Abhyut Tangri