Decoding AI Jargons with Chai

Table of Content
Transformer
Tokenization
Encoder
Decoder
Vocab Size
Vector Embeddings
Positional Encoding
Self Attention
Softmax
Temperature
Knowledge Cutoff
Transformer
What it is :
Transformer is nothing but a model used in artificial intelligence, especially in understanding and generating content.
Imagine you are reading sentence like:
“The cat sat on the mat. It was very soft“
While reading the next sentence we know that “It“ refers to the cat because we remember first sentence. The Transformer does the same thing it looks the whole content and understands the context of every word based on the words around it.
It is like a super reader which reads the whole book and remember all the important connections between the words, no matter how far they are.
Why it is important:
Transformers are the backbone of AI models. They help machines understand language the way we humans do, simply by keeping context in mind.😊
Tokenization
It is a process of breaking down the text into smaller pieces called tokens, which are easier for a machine to understand and process.
For ex:-
"I love ice cream!"
We split it into pieces
["I", "love", "ice", "cream", "!"]
After which we can encode these tokens into numerical ids which computer understands.
So before a transformer model do its magic we need feed the proper ingredients and transformer as a chef can cook something amazing for us.'😊
Encoding
Encoding is the process of turning tokens into numbers—because as I said, machines don’t understand words, they understand numbers.
So far we have tokens:
["I", "love", "ice", "cream", "!"]
Now lets say we give each token a number like this:
"I" → 101
"love" → 202
"ice" → 303
"cream" → 404
So the sentence becomes:
[101, 202, 303, 404]
These numbers are what after some operations are going to be fed into the Transformer model.
So we are trying to bridge the gap between human language the machine as they only understands the math.
There are Tools and Libraries for Tokenization and Encoding:
tiktoken(by OpenAi)
Hugging Face Tokenizers
SentencePiece(by Google)
Decoding
Decoding is the process of converting numbers (or tokens) back into human-readable text.
[101, 202, 303, 404] this turns back into
"I love ice cream"
Vocab Size
Vocab size is the total number of unique tokens that a model knows about — basically, the "word library" the model can understand and use.
The model’s vocabulary = Dictionary
Vocab Size =The number of entries in that dictionary
GPT-4o has around 199,997 tokens in its vocabulary.
We need to balance between small and large vocab sizes, as small vocab size won’t allow to understand or generate more words and at the same time large vocab size is harder to train, uses more memory.
Vector Embeddings
A vector embedding is a way to represent a word (or token) as a list of numbers that capture its meaning in a mathematical form.
Let’s say the word “cat” is tokenized as 101.
Now instead of just using 101, the model wants to understand what "cat" means — so it turns it into a vector embedding like:
"cat" → [0.27, -0.13, 0.91, 0.44, ..., 0.02]
This will be a long list of number (in each dimension), each number captures some aspect of the word’s meaning like:-
One number might relate to it being an animal
Another might relate to its cuteness
Another might place it near “dog” but far from “car” in meaning
Remember initially I said that transformers helps in keeping context, this is how it does, this is how it relates to other words.
So embeddings capture the meaning and relationships between words, this is how AI understands the relation. Similar words are very close to each other, you can even do a math for example:-
king - man + woman = queen
Words like "cat", "dog", and "pet" live close together on the map.
Words like "car" and "banana" are in very different places.
It is nothing but a multidimensional map which gives the real meaning of a word.
Tools for embeddings:-
OpenAI Embedding API (e.g.,
text-embedding-ada-002
)Hugging Face Transformers (
model.get_input_embeddings()
)spaCy and Gensim (for Word2Vec, FastText)
Semantic Meaning
It’s how Transformers keep track of word order in a sentence — since they don’t read words in sequence like humans or older models (like RNNs) do.
The order matters a lot here, right?
"The cat chased the mouse" → makes sense
"The mouse chased the cat" → totally different meaning!
But here’s the catch:
Transformers don’t process words one-by-one in order.
They look at all words at the same time — like scanning a whole sentence in parallel.
So how does transformer knows the order, here positional encoding comes to rescue.
Positional encoding adds info like:
“This is the 1st word”
“This is the 2nd word”
…and so on.
But where?
Yes, you guessed it right in vector embeddings,
So the final input to a transformer becomes a blend of two things:
What the word means
Where the word is in the sentence
Self Attention
Self-attention is the mechanism that lets a Transformer model focus on different words in a sentence when processing each word, so it can understand context deeply.
Let’s continue with the same example
“The cat sat on the mat.”
When the model reads the word “sat”, it shouldn’t look at just “sat” in isolation.
It should think:
Who sat?
Where?
What’s the subject?
So it looks at:
"cat" (to know who sat)
"mat" (to know where it sat)
maybe even "the" (for structure)
This process of looking around the sentence and deciding which words to "pay attention" to is called self-attention.
Instead of just going left to right (like traditional models), self-attention lets each word:
Look at every other word
Assign a score for how important that word is for understanding the current word
Blend information together based on those scores
So while processing “sat”, the model might say:
“cat” is very important → weight: 0.9
“mat” is somewhat important → weight: 0.6
“the” is less important → weight: 0.2
Then it creates a weighted summary that helps it understand the meaning of "sat" in context.
Benefits:-
It helps in capturing long-range relationships.
It enables parallel processing makes it faster.
Softmax
Softmax is a math function that turns a bunch of raw numbers into probabilities — a clean way of saying “how confident” the model is about each option.
Imagine the model is trying to guess the next word after this:
“The cat sat on the ___”
It thinks:
“mat” → score: 8.5
“sofa” → score: 6.2
“roof” → score: 3.9
“piano” → score: 2.1
These scores are called logits — they’re raw, unscaled, and not super useful yet.
Now comes Softmax — it takes those scores and transforms them into probabilities that add up to 1.0 (or 100%), like this:
“mat” → 0.72 (72% confident)
“sofa” → 0.17
“roof” → 0.08
“piano” → 0.03
Now the model can make a decision:
"I think ‘mat’ is the most likely next word."
Softmax is like the voting machine in the model’s brain.
Temperature
Temperature is a setting that controls how random or confident the model is when choosing the next word.
How Temperature Affects Output:
Temperature = 1.0
Normal randomness (default balance).
The model picks based on the original probabilities.Temperature < 1.0 (e.g. 0.2)
Less randomness.
The model becomes more confident and mostly picks the highest-probability word (like “mat”).
It plays it safe.Temperature > 1.0 (e.g. 1.5 or 2.0)
More randomness.
The model spreads out the probabilities, so lower-scoring options become more likely.
It might say “throne” or “floor” just to be creative.
Knowledge Cutoff
A knowledge cutoff is the latest point in time when an AI model was trained on data.
Anything that happened after that date? The model doesn’t know it unless it has access to the internet or gets manually updated.
Training huge models takes months and millions of dollars. You can’t just update them daily like a phone app.
Thanks for reading!!!
Subscribe to my newsletter
Read articles from Rahul directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
