Overview of transformer architecture

Setting the context

This article is based on a task given by Chaicode for the first class of the "GenAI with Python v2" batch. The first class was an introduction to GenAI for Developers. The "for Developers" keyword is very important because, in this article, we will explore components of GenAI from a developer’s perspective, someone who will be developing applications powered by GenAI, not from a researcher's perspective who wants to delve into the depths of GenAI. So, if you want to understand the depths of transformers and similar concepts, this article is not for you.

Some Terminologies

In AI, words are equivalent to tokens. A token can represent a single word, a subword, or even a character. A collection of words is called a sentence, and in the same sense, a collection of tokens is called a sequence in AI.

Introduction

GenAI (Generative AI) has become quite the buzzword today. Some view it as a groundbreaking technology, while others view it as nothing more than autocorrect on steroids. But what exactly is this "steroid" It refers to the transformer architecture which was a breakthrough in AI. In 2017, Google published a research paper titled “Attention is All You Need” introducing the transformer architecture. This innovation laid the foundation for the development of transformer models like BERT, GPT.

Why Did Google Create the Transformer Architecture?

You might think, “Google created it to develop their AI” but that’s not quite right. In fact, they developed the transformer architecture to improve Google Translator. Let me show you what I mean with an example.

If we convert the sentence “I miss you” into Hindi, word by word:

“I” becomes “मैं”
“miss” becomes “चूकना”
“you” becomes “आप”

This gives us: “मैं तुम्हें चूकता हूँ” but this doesn’t really convey the meaning of “I miss you” in Hindi.

That’s why Google developed the transformer to help machines understand the meaning of a sentence rather than just translating words individually.

What is GenAI?

Generative AI (GenAI) refers to models that are designed to mimic human creativity. How do these models achieve this? They are trained on extremely large datasets, allowing them to learn patterns, styles, and structures from the data. Based on this pre-trained knowledge, the models can generate new content that resembles human-created work. This content can include text, images, code etc. There’s no need for an example here everyone knows it, everyone use it.

Life before transformers

Earlier AI systems used rule-based logic to handle every possible scenario. These systems were rigid and could not handle the unpredictability of human language or creativity. Later, deep learning models marked a shift in AI by enabling machines to learn from data. The introduction of sequence models like RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks) brought improvements in natural language tasks. However, there were still significant limitations, especially when working with longer text sequences. These gaps were later addressed with the introduction of transformers.

To learn more about history of models please refer this article: https://medium.com/@kirudang/language-model-history-before-and-after-transformer-the-ai-revolution-bedc7948a130

The Breakthrough: Transformer

RNNs and LSTMs process each token sequentially, whereas transformers can process all tokens in parallel. This parallel processing capability of transformers generated significant profits for NVIDIA, as GPUs are designed for parallel execution, unlike CPUs, which perform tasks sequentially. But this also generated significant loss of gamers because NVIDIA stopped paying attention to gamers needs.

The core of the transformer architecture is the concept of “attention,” which allows the model to weigh the importance of different words relative to each other, regardless of their position in a sentence.

Let’s understand it with an example, consider the sentence:

"The cat sat on the mat because it was tired."

A traditional model might struggle to understand what "it" refers to, especially in longer sentences. However, with attention, the transformer can focus on "cat" as the important word linked to "it" and correctly interpret the meaning. This mechanism helps the model capture context and relationships between words, even when they are far apart in the text.

Anatomy of Transformer

Above is the transformer architecture. On the left is the encoder, and on the right is the decoder. To provide an overview of this architecture, consider we give the input to the encoder, which generates contextual embeddings. These embeddings are then passed to the decoder along with any previously generated output tokens. The decoder then tries to predict the next word or sequence for the given input.

Now let’s look at components of transformer

Input Embedding

When you provide text input, it is first tokenized. This means the input sequence is broken down into tokens. Below is an image that shows the tokens for the phrase “Hey I am good.”

In the above screenshot, I had to choose a model. So, does that mean tokens are generated specific to each LLM? Yes, for the same text, different models can produce different tokens. The reason behind this is that different LLMs are trained on different vocabularies. A vocabulary is like a dictionary or database that assigns a specific number to a particular word, subword, or letter. Based on this vocabulary, the model generates tokens. Now, in the screenshot below, I have shown you the tokens generated by the GPT-4o model for the same text.

In this screenshot, you can see that for the same text, we have different tokens because GPT-4o was trained on a different vocabulary compared to the previous one, Gemma3:7B. However, in this screenshot, you can also see several other things like “system,” “user,” and “assistant.” These are chat prompts given to the model. System prompts define how the current chat context will respond to upcoming user messages, and assistant messages are the responses generated by the LLM. And there are separators to separate these messages. But that is not today’s topic.

Now we have generated tokens, but why do we need to generate tokens in the first place? Because LLMs do not understand words they understand numbers.

Next, we convert the tokens into vector embeddings using embedding models, which will helpful in providing semantic meaning to the tokens. This was the first function in transformer architecture.

Positional Encodings

This is another vector that contains information about the position of each word in the sequence it is like a metadata of token. This vector is added to the input embeddings to provide the model with positional context, as transformers do not inherently understand word order.

Multi-Head Attention

First, we need to understand self attention.

Self attention is a mechanism that allows the model to understand relationships between tokens, enabling it to decide which parts of the input are most important when processing a particular token. In a sentence, certain key words or tokens can define the overall meaning. If a particular keyword changes, the meaning of the entire sentence can also change.

For example, in the phrase “The cat sat on the mat” when focusing on the word “sat,” the model might pay more attention to “cat” (the subject) and “mat” (the location) compared to the other words.

After applying positional encoding to the input embeddings, the resulting vectors are compared with each other. Based on these comparisons, attention computes weighted sums of the resulting vectors, where the weights depend on how relevant other words are to the current word. The attention weights for each token are adjusted by increasing them for relevant tokens and decreasing them for less relevant ones.

So far, this describes just a single attention head. Now, imagine multiple attention heads working in parallel, each one independently performing self attention with its own learned projections. This combination forms multi head attention, allowing the model to capture various types of relationships between tokens.

Add & Norm

Add refers to the residual connection, where the original input of a sub-layer is added to its output. Instead of just passing the output of a sub-layer to the next layer, Transformers add the input to the sub-layer’s output. This preserves information from earlier layers and helps with gradient flow during training, this also prevents gradient vanishing.

Norm refers to layer normalization, which is applied after the addition of residual connection. It adjusts and stabilizes the output by normalizing the combined result, improving convergence and model stability during training.

Feed Forward

The feed forward layer is a fully connected neural network applied independently to each token in the sequence. It helps in optimizing the learning process by enabling the model to understand more complex patterns and relationships between tokens. The feed forward network usually consists of two linear transformations with an activation function in between, allowing the model to introduce non linearity and improve its representational capacity.

Types of Transformers

Based on the general transformer architecture different types of transformer models were introduced such as encoder Only model BERT and decoder only model GPT. There are more than these two models but we will only look into these two in this article.

BERT (Bidirectional Encoder Representations from Transformers)

BERT is a transformer-based model that uses only the encoder part of the original Transformer architecture. It was developed by Google in 2018. BERT is designed for tasks that require understanding the context of input sequences.

GPT (Generative Pre-trained Transformer)

GPT is a transformer-based model that uses only the decoder part of the original Transformer architecture. It was developed by OpenAI, with the first version released in 2018 and followed by GPT-2, GPT-3, and GPT-4. GPT is optimized for text generation, summarization and dialogue systems.

How does GPT generates text?

A user provides an input, which is called a prompt, and this prompt is passed to GPT. GPT then encodes the prompt and predicts the next most relevant word for the input sequence. The predicted word is added to the input, and the model then predicts the following word, and so on. This process continues until the stopping condition is met or the maximum number of tokens is reached.

Training and Inference

During the training period, the model learns patterns and relationships from massive datasets, including large portions of text from the internet. This training process involves updating the model's internal weights using machine learning techniques, enabling it to predict the next word in a sequence based on previous context. Whereas, inference is the stage after training. During inference, the model is presented with input it has never seen before and generates new content based on its learned knowledge. In this phase, the model’s weights remain fixed, and it simply applies its learned patterns to predict outputs for the given input.

Conclusion

In this class, I learned why transformers were introduced in the first place and how they revolutionized GenAI. I also gained a deeper understanding of the inner workings of transformers, from the moment I send a prompt to the model to the moment it generates a response. This journey covered key concepts like tokenization, embedding, attention, and decoding. By grasping these core ideas, I now have a clearer view of how modern GenAI models function behind the scenes

Introduction to GenAI For Devs

Table of contents