"From Convolutions to Attention: Unpacking the Architecture Behind Modern AI"

Introduction
Transformers have revolutionized machine learning — powering models like BERT, GPT, T5, and even recent image models like ViT. But where did it all start?
This blog breaks down the 2017 paper “Attention Is All You Need” by Vaswani et al., the blueprint of modern NLP and generative AI.
What Makes This Paper Important?
Before Transformers, models like LSTMs and GRUs dominated sequence modeling. But they had limitations:
Couldn’t parallelize well
Struggled with long-term dependencies
Transformers solved this using:
Self-Attention (no recurrence!)
Positional Encoding
Multi-head attention
Key Components (Simplified)
1. Self-Attention Mechanism
Each word looks at all other words in the sequence to understand context.
Instead of processing sequentially, it handles all positions in parallel.
Formula:
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) * V
Where:
Q = Queries
K = Keys
V = Values
dₖ = dimension of keys
2. Multi-Head Attention
Improves learning by combining attention outputs from different subspaces.
Instead of one attention mechanism, it runs multiple in parallel and concatenates results.
3. Positional Encoding
- Since the model doesn't process sequences in order, it uses sine/cosine functions to inject position info.
4. Feed-Forward Networks
- After attention layers, the model passes data through regular dense layers for transformation.
5. Encoder–Decoder Architecture
Encoders: process input sequences
Decoders: generate outputs (e.g., translated text)
Intuition
Imagine reading a sentence and deciding which word depends on which — that’s what attention layers do, but mathematically.
This shift allowed:
Parallel training
Better long-distance relationships
Scalability to very large datasets (like GPTs and BERTs)
Transformers are now used beyond NLP:
Vision Transformers (ViT)
Audio transformers
Multimodal AI
Final Thoughts
Even if you're not building Transformers from scratch yet, understanding how they work sets you apart as an ML practitioner. Reading and explaining complex research is a valuable skill — and this paper is the best place to start.
References
Vaswani, A. et al. “Attention Is All You Need.” arXiv:1706.03762 [cs.CL], 2017.
Official paper link :
https://arxiv.org/abs/1706.03762
Cardiff Library permalink:
https://librarysearch.cardiff.ac.uk/permalink/44WHELF_CAR/b7291a/cdi_arxiv_primary_1706_03762
Subscribe to my newsletter
Read articles from Raj Dhakad directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
