Transformer

A transformer is a deep learning architecture that was introduced by Google in paper “Attention is All You Need” in 2017. It forms the backbone of the LLM like GPT, Llama and Gemini.

It handles the sequential data (like language) more efficiently and in parallel, as opposed to the earlier models like RNNS and LSTMs which processed sequences one step at a time. It is based on:

Next Word prediction

Transformer models operate on the principle of next-word prediction. Given an input, it tries to fins the most probable word in the input.

E.g.

The input is “The cat sat on the”.

The model calculates the probability distribution over its vocabulary. It picks the most probable next word — let’s say, "mat" — based on what it has learned from massive amounts of training data.

Self-Attention

Self-attention is the core innovation and power of transformers. It allows them to understand the meaning of the words in context, even across long distances.

e.g.

"The animal didn’t cross the street because it was too tired."
→ What does "it" refer to? A human? The animal?

This innovation leads to much better performance in text generation, translation, summarization, and more.

Tokenizer

Unlike humans, transformers understand numerical data rather than words. The user input needs to convert to numerical representation for its processing thru LLM. Tokenizer is tool or algorithm that does the job of:

Splitting the input text into tokens. The input it broken into tokens– I.e. usually a word, subword or character.

e.g.

Text: The food is in the plate.

Tokens: [“The”, “food”, “is”, “in”, “the, “plate”]

Mapping each token to a unique ID that the model can understand.

E.g.

The food is in the plate.

Each of this token would be represented by a number. E.g. The is represented with number 976,

Food with 4232 and so on.

Vocabulary size

The tokenizer has a set of vocabulary that is used to convert text to numerical data. The size of the vocabulary is vocabulary size. Each model that is used to generate token has different vocabulary and thus different vocabulary size.

Vector embedding

An embedding is a dense vector representation of a token (word, sub word, or character) which is used as input to a neural network. Vector embedding represents the semantic meaning. Over training, these embeddings learn to reflect semantic similarity:

"king" and "queen" → similar vectors

"dog" and "bark" → closer than "dog" and "car"

"run" and "ran" → closer due to contextual usage

These relationships are learned i.e. not manually defined.

Positional embeddings

Positional embedding adds the meaning to each token embedding that indicates position of the tokens in the sequence.

E.g.

The Cat sat on the Rat

The Rat sat on the Cat

Both of these have same tokens, but their position changes the semantics of the text totally. That is why positional embedding is needed.

E.g. for The Cat sat on the Rat, positional encoding is different from The Rat sat on the Cat

Encoder

Encoder transforms sentences into mathematical sequence.

The process of converting text into tokens is known as encoding. Each model have different representation of the token for the text. E.g. shows encoding using gpt-40. If a different model is chosen, it would generate different tokens.

import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
print("Vocab size" , encoder.n_vocab)
text = "The food is inside the plate."    # Example text to encode
tokens = encoder.encode(text)
print("Tokens:", tokens)

Output
Vocab size 200019
Tokens: [976, 4232, 382, 6772, 290, 14651]

Decoder

Decoding is the process of converting tokens back to text.

import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
print("Vocab size" , encoder.n_vocab)
text = "The food is inside the plate."    # Example text to encode
tokens = encoder.encode(text)
print("Tokens:", tokens)
print("Decoded text:", encoder.decode(tokens))

Output
Vocab size 200019
Tokens: [976, 4232, 382, 6772, 290, 14651]
Decoded text: The food is inside the table.

Self-Attention

Token talks to each other to adjust their embeddings. It enables the model to determine the most important part by looking into different parts of the sequence. The model can calculate attention scores, which determine how much focus each token should receive when generating predictions.

Multi-head Attention

Multi-head attention looks into different aspects, as an example it may look into aspects such as :

What
When
Who

The vectors are split into different heads and each head process a segment of the embeddings independently, capturing different syntactics and semantics relationship.

Temperature

Temperature determines the randomness. It is used by SoftMax function to determine the creativeness.

temperature = 1: It has no effect on the softmax outputs.
temperature < 1: Lower temperature makes the model more confident and deterministic, leading to more predictable outputs.
temperature > 1: Higher temperature allows more randomness in the generated text – what some refer to as model “creativity”.

Softmax

Softmax helps answer the question:

“Given a list of scores, how likely is each option compared to the others?”

It converts a vector of K real numbers into a probability distribution of K possible outcomes.

In a neural network that predicts digits (0–9), the output might look like this:

Raw output: [1.5, 2.8, -0.3, 5.1, ..., 0.9]

Softmax: [0.01, 0.04, 0.005, 0.85, ..., 0.002]

So the model thinks:

“It’s probably a 3 with 85% confidence.”

Knowledge cut-off

The knowledge cutoff refers to the latest date up to which an AI model (like me, ChatGPT) has been trained on information from the internet, books, articles, etc.