Transformers have revolutionized natural language processing by using attention mechanisms to model long-range dependencies. In this post, we’ll journey from the origins of attention to building a mini GPT model (“GPTMini”) from scratch. We’ll start with how attention arose to overcome RNN limitations, then break down the Transformer architecture, including self-attention, positional encoding, multi-head attention, feed-forward layers, and residual connections, in simple terms. Finally, we’ll walk through implementing a GPTMini project with a custom tokenizer, WikiText dataset, Transformer decoder model, training loop, and text generation.

A Brief History: From RNNs to “Attention Is All You Need”

Recurrent Neural Networks (RNNs) and their gated variants (LSTMs) once dominated sequence modeling, but they struggled with long-range relationships due to sequential processing and vanishing gradients. In 2014, Bahdanau et al. introduced the attention mechanism to help RNN-based translation models focus on relevant parts of the input sequence at each decoding step. Instead of encoding an entire sentence into a single vector (a “context vector”) as in vanilla sequence-to-sequence models, attention allowed the decoder to dynamically look at (“attend to”) specific encoder states (words) when producing each output word. This alignment mechanism significantly improved translation quality by bypassing the fixed-length bottleneck of RNNs.

Fast-forward to 2017: Vaswani et al.’s paper “Attention Is All You Need” dispensed with recurrence altogether and proposed the Transformer – an architecture built entirely on attention layers and feed-forward networks. The Transformer’s breakthrough was showing that you can achieve state-of-the-art results in translation (and soon many other tasks) using only attention mechanisms in a highly parallelizable model, making training faster than RNNs. The Transformer introduced several key ideas we’ll unpack below: self-attention, multi-head attention, positional encoding, and residual connections. These innovations enabled models to capture global context and long-term dependencies more effectively than RNNs

Transformer Fundamentals: Attention and Friends

Before building our mini GPT, let’s understand the core concepts of the Transformer architecture in an approachable way.

Self-Attention: Paying Attention to Context

At the heart of Transformers is the self-attention mechanism. In simple terms, self-attention lets every token (word or subword) in a sequence look at every other token and decide how much attention to give to each. This means each word can gather useful information from the whole sequence when computing its representation, rather than only from nearby words or previous words. As a result, the model can capture relationships like long-distance dependencies (e.g. pronoun-reference links, subject-verb agreements) much better than an RNN, which processes words one by one

How does self-attention work? Imagine we have a sentence, and we want to compute new representations for each word that include contextual information. The Transformer does this by creating three vectors for each token: a Query vector (Q), a Key vector (K), and a Value vector (V). You can think of this like an information retrieval system: each token’s Query is like a question, and each other token has a Key (an identifying summary) and a Value (the content info). To figure out how much one word should pay attention to another, we compute a similarity score between the Query of the first word and the Key of the second word (for all pairs). These scores (often scaled dot-products) form an attention matrix of size n×n (for n tokens). We then apply a softmax to turn these scores into attention weights that sum to 1 for each query word. Finally, each word’s output representation is computed as a weighted sum of all Value vectors, using these attention weights.

In essence, self-attention produces a context-aware representation for each token. If word A’s Query aligns strongly with word B’s Key, then word A’s representation will include a lot of word B’s information (Value). For example, in the sentence “The animal didn’t cross the road because it was tired,” the pronoun “it” might have a high attention weight on “animal”, effectively letting “it” incorporate context from “animal” into its representation to clarify the reference

This mechanism is powerful because it’s order-agnostic (any token can attend to any other, regardless of position) and parallelizable (all token relationships can be computed simultaneously). It gives Transformers the flexibility to capture both local and long-distance relationships in a sequence. Notably, Vaswani et al. used a specific form called scaled dot-product attention: they compute scores as dot products of Q and K, scale by √d_k (the vector dimension) to stabilize gradients, and softmax

Positional Encoding: Adding Order to Sequences

One catch: since self-attention doesn’t inherently know the positions of tokens (unlike RNNs, which process in order), Transformers need a way to inject sequence order information. This is done via positional encoding. In the original Transformer, they added a deterministic positional embedding to token embeddings using sine and cosine functions of varying frequencies for each position Intuitively, these positional encodings produce a unique pattern for each position; the model can learn to infer relative positions (e.g. one position’s encoding dot-product another’s indicates how far apart they are). In practice, some implementations use learned positional embeddings (trainable vectors for each position index) instead of sinusoidal formulas, both work to give the model a sense of word order.

The result is that the input to the Transformer is enriched with positional information: TokenEmbedding + PositionEmbedding for each token. This allows the self-attention layer to be aware of sequence ordering (so it doesn’t treat a sentence as a bag of words).

Multi-Head Attention: Many Perspectives at Once

A single self-attention head (one set of Q, K, V projections) is great, but the Transformer improves on this with multi-head attention. Instead of computing one attention distribution, the Transformer uses multiple attention “heads” in parallel. Each head operates on the same input but with its own learned projection matrices, so it might focus on different aspects of the sequence. For example, one head could attend to syntactic relationships (e.g., noun-adjective), another to long-range dependencies, and yet another to anaphora resolution. The idea is that multiple heads allow the model to capture diverse types of relationships simultaneously.

In practice, if the model’s hidden size is \(d_{model}\), and we choose \(h\) heads, we project the input into \(h\) different subspaces of dimension \(d_k = d_{model}/h\). Each head yields its own attention output (size \(n \times d_v\); often \(d_v\) is chosen equal to \(d_k\)). We then concatenate the outputs of all heads (getting back \(n \times d_{model}\)), and pass it through a final linear layer so that multi-head attention still produces an output of the same dimension \(d_{model}\).

Formally, if \(head_i(Q,K,V)\) is the output of the \(i\)-th attention head, then:

\[MultiHead(Q,K,V) = Concat(head_1, ..., head_h) W^O\]

where \(W^O\) is a learned projection matrix.

Don’t worry if the details sound complex; the key takeaway is that each head can attend to different patterns or parts of the sequence, and the model merges this information. Empirically, this improves the model’s capacity and performance.

Feed-Forward Networks and Residual Connections

After the attention mechanism, each Transformer layer includes a simple position-wise feed-forward network (FFN). This is just an MLP applied to each token’s vector independently: typically a linear layer expanding the dimension (e.g., from \(d_{model}\) to some larger \(d_{ff}\)), a non-linearity (ReLU), then a linear layer back to \(d_{model}\). For example, in Vaswani et al., \(d_{model}=512\) and \(d_{ff}=2048\) – so a token’s 512-dim representation is converted to 2048-dim, non-linearly transformed, then back to 512-dim. This FFN gives the model extra transformation power at each position (learning higher-level features or mixing the information that attention collected). Importantly, because the feed-forward is applied to each position separately (no interactions across tokens), it does not destroy the contextual mixing done by attention; instead, it processes each token’s enriched representation in isolation. This also means it can be parallelized for all tokens.

Critically, Transformers employ residual connections (skip connections) and layer normalization around both the attention sub-layer and the FFN sub-layer. In each layer, the input to the sub-layer is added to its output (“adding the skip”) and then normalized. These residual connections help gradients flow and allow the model to retain the original input features while the sub-layer learns a refinement. Layer normalization stabilizes training by normalizing the output at each layer. In simple terms, the computation in one Transformer layer looks like:

Self-Attention layer:

\[X_{\text{attn}} = \text{SelfAttn}(\text{LayerNorm}(X_{\text{in}}))\]

Then:

\[X_{\text{mid}} = X_{\text{in}} + X_{\text{attn}}\]

(This applies attention to normalized input, then adds it back to the input).

Feed-Forward layer:

\[X_{\text{ffn}} = \text{FFN}(\text{LayerNorm}(X_{\text{mid}}))\]

Then:

\[X_{\text{out}} = X_{\text{mid}} + X_{\text{ffn}}\]

This "add & normalize" structure is repeated in every layer. It allows deep stacking of layers without losing the original signal or causing training to diverge.

Transformer Architecture Recap

Putting it all together, the original Transformer architecture consists of an Encoder (stack of N identical layers, each with self-attention + FFN) and a Decoder (stack of N layers, each with self-attention, encoder-decoder attention, aka cross-attention, and FFN). The encoder-decoder attention in each decoder layer allows the decoder to attend to the encoder’s output (source sentence) in addition to attending to previous decoder tokens. This is crucial for tasks like translation, where the decoder needs to focus on the source sequence. In purely generative settings (like GPT language models), we don’t use an encoder at all; we only have decoder blocks with self-attention.

One important detail for decoder self-attention (used in language generation) is masking. The decoder operates autoregressively: when predicting the next token, it should not peek at later positions. Thus, decoder self-attention uses a causal mask to zero out attention weights to future tokens. This mask is a triangular matrix that prevents information flow from position j to i if j > i. In training, this masking ensures the model can’t trivially see the ground-truth next word; at inference, it naturally only attends to already generated tokens.

Now that we’ve covered the theory, let’s get our hands dirty with a practical implementation of these ideas in a mini GPT model.

Building GPTMini: An End-to-End Example

Our project, GPTMini, is a scaled-down GPT-like language model. We will outline each component: a custom Byte-Pair Encoding tokenizer, preparing the WikiText dataset, defining the Transformer (decoder-only) architecture in code, training the model, and finally generating text. The goal is to demystify how these pieces fit together in code, step by step.

Byte-Pair Encoding (BPE) Tokenizer

Real-world Transformer models don’t typically operate on characters or whole words directly – they use subword tokenization. Byte-Pair Encoding (BPE) is a popular subword tokenization algorithm used by GPT-2, GPT-3, BERT, and others. BPE was originally a data compression technique (Gage, 1994) where the most frequent pair of bytes in a text is merged into a single symbol, iteratively. In NLP tokenization, we start with all characters (or bytes) as initial tokens and repeatedly merge the most frequent adjacent token pair to form a new token until we reach a desired vocabulary size. This allows common multi-character sequences (like “ing” or “tion”) to become one token, while rare sequences remain split into smaller pieces. The result is a vocabulary of subword units that balance vocabulary size and representation of word parts, handling unknown words gracefully (any word can be composed from subword pieces).

In GPTMini, we implemented a custom BPE tokenizer from scratch. First, we gather the corpus text and initialize the vocabulary with all unique characters. Then we count character pair frequencies and merge the most frequent pair into a new token, update the corpus by replacing that pair everywhere, and repeat. For example, suppose the corpus has “low” and “lowest” frequently – BPE might merge “l” + “o” -> “lo”, then perhaps “lo” + “w” -> “low”, and eventually “low” + “est” -> “lowest”. After enough merges, common words or chunks become single tokens.

Here’s a simplified snippet (for illustration) of how one might learn BPE merges in Python:

from collections import Counter

def learn_bpe_merges(corpus_tokens, num_merges):
    # corpus_tokens is a list of lists of tokens (start with char tokens per word)
    for _ in range(num_merges):
        # Count all adjacent token pairs in the corpus
        pair_counts = Counter()
        for tokens in corpus_tokens:
            for i in range(len(tokens)-1):
                pair = (tokens[i], tokens[i+1])
                pair_counts[pair] += 1
        if not pair_counts: 
            break
        # Find the most frequent pair
        best_pair = max(pair_counts, key=pair_counts.get)
        # Merge the best pair in all token sequences
        new_token = best_pair[0] + best_pair[1]
        for i, tokens in enumerate(corpus_tokens):
            corpus_tokens[i] = merge_tokens(tokens, best_pair, new_token)
        # Add new_token to vocabulary (and remove old tokens if needed)
        vocab.add(new_token)
    return vocab

# Helper to merge a specific pair in one token list
def merge_tokens(token_list, pair, merged_token):
    i = 0
    merged_list = []
    while i < len(token_list):
        if i < len(token_list)-1 and (token_list[i], token_list[i+1]) == pair:
            merged_list.append(merged_token)
            i += 2  # skip the merged pair
        else:
            merged_list.append(token_list[i])
            i += 1
    return merged_list

In practice, one would also handle spaces or use byte-level BPE (operating on raw bytes to include all characters, which GPT-2 did for robustness. After training BPE on our WikiText corpus, we save the merge rules and vocabulary. Then we can use it to encode any text into a sequence of token IDs and decode IDs back to text.

For example, after training, encoding a word like “transformers” might yield tokens ["trans", "form", "ers"] if those subwords are in the vocab. The tokenizer also appends special tokens as needed (like an end-of-sequence or padding token, though GPT usually doesn’t use start/end tokens explicitly, just an end-of-text token).

Why BPE? It provides a balanced vocabulary where common words are one token and uncommon words are split into meaningful pieces. This drastically reduces out-of-vocabulary issues compared to word-level tokenization and is more efficient than character-level (which would make sequences very long). BPE is used by GPT-2 and GPT-3, for instance, to achieve a vocabulary of around 50,000 tokens that can encode text from a variety of languages

Dataset Preparation with WikiText

For training GPTMini, we need a large text corpus. We chose the WikiText dataset, a popular benchmark for language modeling. WikiText-2 (the smaller version) contains about 2 million tokens of Wikipedia articles (extracted from “Good” and “Featured” quality articles), and WikiText-103 (the full version) has over 100 million tokens. These are clean, curated Wikipedia texts that are great for training a language model from scratch.

We downloaded the WikiText corpus and applied our BPE tokenizer to it, converting the text into a stream of token IDs. Since language models are usually trained to predict the next token in a sequence, we formatted the data as sequences of a fixed length (say, 128 tokens long). Each training sample is a snippet of the tokenized text (128 tokens), and the target is the same sequence shifted one position to the left (so at each position, the model learns to predict the next token). We slide this window through the entire corpus to generate many training examples. Alternatively, one can randomly sample chunks of text of length 128 each epoch.

Here’s how we can create training sequences in code:

token_ids = tokenizer.encode(full_text)  # convert entire text to list of token IDs
seq_length = 128
train_data = []
for i in range(0, len(token_ids) - seq_length):
    input_seq = token_ids[i : i+seq_length]
    target_seq = token_ids[i+1 : i+seq_length+1]
    train_data.append((input_seq, target_seq))

In practice, we wouldn’t explicitly materialize train_data as a list of all sequences (which could be huge); instead, we use a Dataset object that reads batches on the fly. For example, in PyTorch, you might create a torch.utils.data.Dataset that wraps the token list and yields slices of length 128. We also shuffle the data or randomize starting positions for each epoch to alleviate fixed segmentation.

After this preparation, we have a dataset of input-target pairs of token sequences. Now we’re ready to build the Transformer model that will learn to predict tokens.

The GPTMini Model: Transformer Decoder Architecture

GPTMini’s architecture is a Transformer decoder, essentially the same building blocks as the original Transformer’s decoder part, but possibly with fewer layers or smaller dimensions for simplicity. Conceptually, our model will do the following:

Take a sequence of input tokens, embed them, add positional encodings, pass them through a stack of Transformer decoder blocks (with self-attention and feed-forward sublayers), and output a probability distribution for the next token at each position. Because we train it with next-token prediction, when we feed a sequence, the model’s output at position t is trying to predict the token at t+1.

Let’s break down the components in code. We’ll use PyTorch-like pseudo-code for clarity:

Embeddings: We need an embedding layer for token IDs, and we’ll add positional encodings. If we use learnable position embeddings, we can just have a second embedding table for positions.

import torch.nn as nn

class GPTMini(nn.Module):
    def __init__(self, vocab_size, d_model, max_seq_len, n_layers, n_heads, d_ff):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb   = nn.Embedding(max_seq_len, d_model)
        self.layers    = nn.ModuleList([
            DecoderBlock(d_model, n_heads, d_ff) for _ in range(n_layers)
        ])
        self.norm      = nn.LayerNorm(d_model)
        self.out_head  = nn.Linear(d_model, vocab_size)

Here DecoderBlock will be a class implementing one Transformer decoder layer (self-attention + feed-forward). We pass in n_heads and d_ff (hidden size of feed-forward) as well. We also have a final linear out_head layer that maps the final hidden state of each position to vocabulary logits for prediction.

Self-Attention Sub-layer: We implement this as part of the DecoderBlock. We need to compute multi-head masked self-attention. Let’s define a single attention head first, then generalize.

A simplified single-head attention (without batching for brevity) could be:

def single_head_attention(Q, K, V, mask=None):
    scores = Q @ K.T              # (T_q × T_k) raw attention scores
    scores = scores / math.sqrt(Q.size(1))  # scale by √d_k
    if mask is not None:
        scores.masked_fill_(mask == 0, float('-inf'))  # mask out future positions
    weights = scores.softmax(dim=-1)   # (T_q × T_k) attention weights
    return weights @ V                # weighted sum of values (T_q × d_v)

In a multi-head setting, we have n_heads different projection matrices for Q, K, V. In code, we often combine them for efficiency. For example, we project the input X (of shape [batch_size, T, d_model]) into Q, K, V of shapes [batch_size, T, d_model] each, then split those into [batch_size, n_heads, T, d_k]. Each head then performs the above operation in parallel. After that, we concatenate the heads and project back to d_model. Here’s how a multi-head self-attention forward might look:

class MultiheadSelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        # Projection layers for Q, K, V (we use one big weight for all heads for each)
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        # Output projection
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, X, mask=None):
        B, T, C = X.shape  # batch, seq_length, d_model
        # Project inputs to multi-head Q, K, V
        Q = self.W_q(X)  # shape (B, T, C)
        K = self.W_k(X)
        V = self.W_v(X)
        # Split into heads: reshape to (B, T, n_heads, d_k) then transpose
        Q = Q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)  # (B, n_heads, T, d_k)
        K = K.view(B, T, self.n_heads, self.d_k).transpose(1, 2)  # (B, n_heads, T, d_k)
        V = V.view(B, T, self.n_heads, self.d_k).transpose(1, 2)  # (B, n_heads, T, d_k)
        # Compute scaled dot-product attention for each head
        # (We transpose K to (B, n_heads, d_k, T) for matmul)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)  # (B, n_heads, T, T)
        if mask is not None:
            scores = scores.masked_fill(mask[:, None, None, :] == 0, -1e9)
        weights = torch.softmax(scores, dim=-1)  # (B, n_heads, T, T)
        attn_output = torch.matmul(weights, V)   # (B, n_heads, T, d_k)
        # Concatenate heads back together
        attn_output = attn_output.transpose(1, 2).contiguous().view(B, T, C)  # (B, T, d_model)
        return self.W_o(attn_output)  # final linear proj

Don’t worry if the tensor manipulations are a bit much, the idea is we computed attention for each head and recombined. The mask we pass will be a tensor that is 0 where we want to mask out future tokens (for positions beyond the current timestep in the decoder) and 1 elsewhere. This ensures the model only attends to positions ≤ current position.

Decoder Block: Now we create the DecoderBlock using the attention and a feed-forward sub-layer:

class DecoderBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff):
        super().__init__()
        self.self_attn = MultiheadSelfAttention(d_model, n_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        # Self-attention sub-layer with residual connection
        attn_out = self.self_attn(self.ln1(x), mask=mask)
        x = x + attn_out
        # Feed-forward sub-layer with residual connection
        ff_out = self.feed_forward(self.ln2(x))
        x = x + ff_out
        return x

Each DecoderBlock applies layer norm to the input (ln1(x)), performs masked self-attention, adds it back, then layer norm again and feed-forward, and adds it back. We don’t show an encoder-decoder attention here because GPTMini is decoder-only (no encoder to attend to).

Now we can complete the GPTMini.forward method:

    def forward(self, idx):
        B, T = idx.shape  # idx is input token indices (batch, seq_length)
        # Create causal mask for attention (shape T×T, lower triangular ones)
        device = idx.device
        mask = torch.tril(torch.ones(T, T, device=device)).bool()  # True where allowed
        # Embed tokens and positions
        token_embeddings = self.token_emb(idx)            # (B, T, d_model)
        positions = torch.arange(0, T, device=device).unsqueeze(0)  # (1, T)
        pos_embeddings = self.pos_emb(positions)          # (1, T, d_model)
        x = token_embeddings + pos_embeddings             # combine token + position
        # Apply each Transformer decoder layer
        for layer in self.layers:
            x = layer(x, mask=mask)
        # Final layer norm (often used at end of Transformer decoder stack, e.g. GPT-2)
        x = self.norm(x)  # (B, T, d_model)
        # Output logits for each position and token in vocab
        logits = self.out_head(x)  # (B, T, vocab_size)
        return logits

A few things to note: we constructed a mask using torch.tril (lower-triangular matrix of ones) which will be broadcast to shape (B, n_heads, T, T) inside attention to mask out future positions. We add token and positional embeddings to get our initial x. After the Transformer layers, logits contains scores for each vocabulary token at each sequence position.

During training, we will call model(input_ids) to get logits, which has shape [batch_size, seq_length, vocab_size]. We then compare it to the target_ids (the ground-truth next tokens) to compute loss. Typically we use CrossEntropy loss applied to each position. In PyTorch, we’d flatten the predictions and targets to shape [batch_size*seq_length] when computing F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1)).

To recap, GPTMini’s architecture includes everything we discussed: token + positional embeddings, multi-head masked self-attention in each layer, residual connections with layer norm, feed-forward networks, and an output projection. It’s essentially a small GPT model.

Training Loop

With the model and data ready, we train GPTMini using a next-token prediction objective (language modeling). We iterate over the dataset of input-target sequences, and for each batch, do the following:

Forward pass: Compute logits = model(input_ids).
Loss computation: Compare logits to target_ids. We use cross-entropy loss, which is perfect for classification over the vocabulary for each position. In code: loss = F.cross_entropy(logits.view(-1, vocab_size), target_ids.view(-1)). This calculates the average negative log-likelihood of the correct tokens.
Backward pass: optimizer.zero_grad(); loss.backward(); optimizer.step(). We typically use the Adam optimizer (often with weight decay and a learning rate schedule) as used in the original Transformer paper.

A simplified training loop:

import torch
import torch.nn.functional as F

model = GPTMini(vocab_size, d_model=384, max_seq_len=128, n_layers=6, n_heads=6, d_ff=1024)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

for epoch in range(num_epochs):
    for input_ids, target_ids in train_loader:
        logits = model(input_ids)              # (batch, T, vocab_size)
        loss = F.cross_entropy(logits.view(-1, vocab_size), target_ids.view(-1))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch} loss: {loss.item():.4f}")

We also validate the model on a held-out validation set (WikiText provides train/validation/test splits) to ensure it’s not overfitting and to track its performance (often measured in perplexity, which is \(\exp(\text{loss})\)). Perplexity tells us how well the model predicts the sample on average the lower, the better.

Text Generation with GPTMini

Once trained, we can use GPTMini to generate text by sampling from the model’s output distribution iteratively. Here’s the general procedure for text generation (also known as autoregressive decoding):

Start with a prompt (some initial text). Tokenize it to get input IDs.
Feed the input through the model to get output logits for the next token.
Convert logits to probabilities (e.g., apply softmax, possibly with a temperature to control randomness).
Sample a token from the probability distribution (or pick the highest probability token for greedy decoding).
Append the sampled token to the input sequence.
Repeat steps 2–5 until you reach a desired length or an end-of-sequence token.

For example, suppose we want GPTMini to generate text after the prompt "Once upon a time":

prompt = "Once upon a time"
input_ids = tokenizer.encode(prompt)  # list of token IDs
for _ in range(50):  # generate 50 tokens
    idx_tensor = torch.tensor(input_ids)[None, :]  # shape (1, seq_len)
    logits = model(idx_tensor)  # forward pass
    next_token_logits = logits[0, -1, :]          # logits for the last token
    # Optionally apply temperature scaling and/or top-k filtering:
    probs = torch.softmax(next_token_logits / temperature, dim=-1)
    next_id = torch.multinomial(probs, num_samples=1).item()  # sample next token
    input_ids.append(next_id)

After this loop, input_ids will contain the original prompt tokens plus 50 newly generated tokens. We can then decode input_ids back to text with tokenizer.decode(input_ids).

To improve quality, one often uses strategies like top-k or top-p (nucleus) sampling to avoid low-probability tokens and encourage more plausible outputs. For instance, top-k sampling picks from the k most probable tokens (e.g., k=50) instead of the entire distribution, which reduces the chance of odd words. Top-p sampling chooses the smallest set of tokens whose cumulative probability exceeds a threshold p (e.g,. 0.9), and samples from that set.

GPTMini, being small, won’t produce Shakespearean prose, but it will learn basic English structure and generate somewhat coherent sentences related to Wikipedia topics in the training data. For example, if prompted with “Once upon a time”, it might continue with something like “in a small village, a young girl named Alice was reading about kings and queens.” (The output will vary each time due to randomness in sampling.)

Putting It All Together

We’ve now gone from theoretical foundations to a working miniature GPT model:

Attention mechanisms enabled the model to handle dependencies without RNNs, examining all words at once
The Transformer architecture uses stacked layers of multi-head self-attention and feed-forward networks with residual connections, allowing deep, parallel computation.
We built a BPE tokenizer to preprocess text into subword units, which is exactly how real GPT models handle open-vocabulary input
We prepared the WikiText dataset as a training corpus, a standard benchmark with millions of tokens from Wikipedia
We defined the GPTMini model in code, including masked multi-head self-attention and all the trimmings of a Transformer decoder.
We ran a training loop optimizing next-word prediction, and then used the trained model to generate text with a simple sampling strategy.

Key Takeaways and Future Directions

Transformers and attention mechanisms have transformed (pun intended) the field of NLP. By letting models attend to all parts of the input, we overcome the bottlenecks of older architectures and achieve greater parallelism and context integration. The self-attention mechanism is the backbone of modern large language models, enabling them to capture meaning across entire documents. In our GPTMini, we saw how these ideas materialize in code, from tokenization to model forward pass.

Going further, one could experiment with improvements and extensions: for example, adding Dropout in the model (the original Transformer used dropout in various places) for regularization, using adaptive softmax or tied input-output embeddings for efficiency, scaling up the model size, or pretraining on a larger corpus. Modern large language models also introduce techniques like transformer-decoder block scaling, learning rate warmups, gradient clipping, and more – but the essence remains the architecture we’ve built.

Transformers aren’t limited to text; they’re used in vision (ViT – Vision Transformer), audio, and multi-modal models. The versatility of attention mechanisms means they can be applied wherever we have sequential or set data that benefits from global interactions.

We hope this guided tour – from the intuition of attention to coding a mini GPT – has demystified how these powerful models work. With these fundamentals, you can better understand breakthroughs like GPT-3 and beyond, or even build your own Transformer models for fun and learning. Happy experimenting with attention and Transformers, sometimes, a little attention goes a long way!

Transformers and Attention Mechanisms: From Basics to GPTMini