AI Terms Explained: A Beginner's guide

Transformer
A transformer is a deep learning architecture that was introduced by Google in paper “Attention is All You Need” in 2017. It forms the backbone of the LLM like GPT, Llama and Gemini.
It handles the sequential data (like language) more efficiently and in parallel, as opposed to the earlier models like RNNS and LSTMs which processed sequences one step at a time. It is based on:
- Next Word prediction
Transformer models operate on the principle of next-word prediction. Given an input, it tries to fins the most probable word in the input.
E.g.
The input is “The cat sat on the”.
The model calculates the probability distribution over its vocabulary. It picks the most probable next word — let’s say, "mat" — based on what it has learned from massive amounts of training data.
- Self-Attention
Self-attention is the core innovation and power of transformers. It allows them to understand the meaning of the words in context, even across long distances.
e.g.
"The animal didn’t cross the street because it was too tired."
→ What does "it" refer to? A human? The animal?
This innovation leads to much better performance in text generation, translation, summarization, and more.
Tokenizer
Unlike humans, transformers understand numerical data rather than words. The user input needs to convert to numerical representation for its processing thru LLM. Tokenizer is tool or algorithm that does the job of:
- Splitting the input text into tokens. The input it broken into tokens– I.e. usually a word, subword or character.
e.g.
Text: The food is in the plate.
Tokens: [“The”, “food”, “is”, “in”, “the, “plate”]
- Mapping each token to a unique ID that the model can understand.
E.g.
The food is in the plate.
Each of this token would be represented by a number. E.g. The is represented with number 976,
Food with 4232 and so on.
Vocabulary size
The tokenizer has a set of vocabulary that is used to convert text to numerical data. The size of the vocabulary is vocabulary size. Each model that is used to generate token has different vocabulary and thus different vocabulary size.
Vector embedding
An embedding is a dense vector representation of a token (word, sub word, or character) which is used as input to a neural network. Vector embedding represents the semantic meaning. Over training, these embeddings learn to reflect semantic similarity:
"king" and "queen" → similar vectors
"dog" and "bark" → closer than "dog" and "car"
"run" and "ran" → closer due to contextual usage
These relationships are learned i.e. not manually defined.
Positional embeddings
Positional embedding adds the meaning to each token embedding that indicates position of the tokens in the sequence.
E.g.
The Cat sat on the Rat
The Rat sat on the Cat
Both of these have same tokens, but their position changes the semantics of the text totally. That is why positional embedding is needed.
E.g. for The Cat sat on the Rat
, positional encoding is different from The Rat sat on the Cat
Encoder
Encoder transforms sentences into mathematical sequence.
The process of converting text into tokens is known as encoding. Each model have different representation of the token for the text. E.g. shows encoding using gpt-40. If a different model is chosen, it would generate different tokens.
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
print("Vocab size" , encoder.n_vocab)
text = "The food is inside the plate." # Example text to encode
tokens = encoder.encode(text)
print("Tokens:", tokens)
Output
Vocab size 200019
Tokens: [976, 4232, 382, 6772, 290, 14651]
Decoder
Decoding is the process of converting tokens back to text.
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4o")
print("Vocab size" , encoder.n_vocab)
text = "The food is inside the plate." # Example text to encode
tokens = encoder.encode(text)
print("Tokens:", tokens)
print("Decoded text:", encoder.decode(tokens))
Output
Vocab size 200019
Tokens: [976, 4232, 382, 6772, 290, 14651]
Decoded text: The food is inside the table.
Self-Attention
Token talks to each other to adjust their embeddings. It enables the model to determine the most important part by looking into different parts of the sequence. The model can calculate attention scores, which determine how much focus each token should receive when generating predictions.
Multi-head Attention
Multi-head attention looks into different aspects, as an example it may look into aspects such as :
What
When
Who
The vectors are split into different heads and each head process a segment of the embeddings independently, capturing different syntactics and semantics relationship.
Temperature
Temperature determines the randomness. It is used by SoftMax function to determine the creativeness.
temperature = 1: It has no effect on the softmax outputs.
temperature < 1: Lower temperature makes the model more confident and deterministic, leading to more predictable outputs.
temperature > 1: Higher temperature allows more randomness in the generated text – what some refer to as model “creativity”.
Softmax
Softmax helps answer the question:
“Given a list of scores, how likely is each option compared to the others?”
It converts a vector of K real numbers into a probability distribution of K possible outcomes.
In a neural network that predicts digits (0–9), the output might look like this:
Raw output: [1.5, 2.8, -0.3, 5.1, ..., 0.9]
Softmax: [0.01, 0.04, 0.005, 0.85, ..., 0.002]
So the model thinks:
“It’s probably a 3 with 85% confidence
.”
Knowledge cut-off
The knowledge cutoff refers to the latest date up to which an AI model (like me, ChatGPT) has been trained on information from the internet, books, articles, etc.
References
Transformer Explainer: LLM Transformer Model Visually Explained
Embedding projector - visualization of high-dimensional data
What are Transformers? - Transformers in Artificial Intelligence Explained - AWS
Subscribe to my newsletter
Read articles from Ashutosh Gupta directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
