Gen-AI Pilot

Generative AI (GenAI) is a type of artificial intelligence that can create new content—like text, images, audio, or code—by learning patterns from existing data. It powers tools like ChatGPT, DALL·E, and GitHub Copilot, enabling machines to generate human-like outputs for creative and practical tasks.
GPT
GPT (Generative Pre-trained Transformer) is an advanced AI model developed by OpenAI. It understands and generates human-like text using deep learning. GPT is pre-trained on large text datasets and then fine-tuned for specific tasks like answering questions, writing code, or summarising content.
Full Form: Generative Pre-trained Transformer
Definition: A neural network-based language model that generates coherent and contextually relevant text by predicting the next word in a sequence.
G – Generative:
Refers to the model's ability to generate new content such as text, answers, stories, or code based on learned patterns.
P – Pre-trained:
Indicates that the model is initially trained on a massive dataset of text from the internet before being fine-tuned for specific tasks.
T – Transformer:
A type of neural network architecture that enables the model to understand context and relationships in language using attention mechanisms.
Together, GPT is a model that generates text using pre-learned knowledge via the transformer architecture.
🔷 What Are Transformers?
Transformers are a deep learning architecture introduced in the paper “Attention is All You Need” (Vaswani et al., 2017). They revolutionized how machines understand language, becoming the backbone of models like GPT, BERT, and T5.
🧠 Key Concepts of Transformers
1. Input Representation
Text is first converted into numerical vectors using tokenization and embeddings.
These embeddings are passed through the model.
2. Self-Attention Mechanism
This is the core innovation.
Each word in a sentence attends to every other word to understand context.
For example, in the sentence “The cat sat on the mat”, the model understands that "cat" is the subject of "sat", even if they are not adjacent.
🔹 Attention scores decide how much importance to give to other words.
3. Multi-Head Attention
Instead of a single attention mechanism, the model uses multiple “heads” in parallel.
Each head captures different types of relationships (e.g., subject-verb, adjective-noun, etc.).
4. Positional Encoding
Since transformers don't process input sequentially (like RNNs), they need positional information.
Positional encodings are added to word embeddings to tell the model the order of words.
5. Feed-Forward Neural Networks
- After attention, the output goes through fully connected layers (like in traditional neural nets) for non-linear transformation.
6. Layer Normalization and Residual Connections
- These techniques stabilize training and help preserve information from earlier layers.
📦 Transformer Architecture (Basic Blocks)
Encoder: Converts input into meaningful representations (used in BERT).
Decoder: Generates output from representations (used in GPT).
Encoder-Decoder: Used in translation tasks (e.g., T5, original Transformer).
🚀 Advantages of Transformers
Parallel processing (faster than RNNs/LSTMs)
Long-range context handling
Scalable and efficient
Foundation for LLMs (Large Language Models)
This diagram illustrates how a Transformer model (especially a GPT model) works in two different contexts:
🔹 Top Part: Basic Transformer for Translation
Input:
"Hello"
Transformer Model: Takes the entire input and produces an output in another language.
Output:
"Namaste"
This represents a sequence-to-sequence task like machine translation where the model translates full sentences from one language to another.
🔹 Bottom Part: GPT (Generative Pretrained Transformer)
This shows how GPT predicts text one token at a time, auto-regressively:
Input:
"Hello my name is Pi"
GPT predicts: the next character is
"y"
→ forming"Piy"
Then it takes
"Hello my name is Piy"
and predicts the next character:"u"
→"Piyu"
Again, it continues with
"Hello my name is Piyu"
→ predicts"s"
→"Piyus"
This loop continues token-by-token, using previous output as part of the next input, which is the core of GPT's autoregressive generation.
🧠 Key Concept:
GPT doesn't see the whole sentence in advance like translation models.
Instead, it generates output step-by-step, always guessing the next word or character based on everything generated so far.
✅ Summary:
Section | Purpose | Behavior |
Top | Transformer for translation | Takes full input → gives full translated output |
Bottom | GPT model (transformer) | Predicts next token step-by-step (autoregressive) |
Original Transformer Architecture:
The basic idea for this transformer was to be used in Google Translate by the company google as we will give one sentence as an input to the transformer and it will convert each word to a different language
This image shows the original Transformer architecture from the paper "Attention Is All You Need" and explains how transformers process sequences. Let's break it down in simple parts:
🧠 What You’re Seeing
This is an Encoder-Decoder Transformer, useful for tasks like translation (e.g., English to Hindi). It has two main parts:
🔷 1. Encoder (Left Side):
Takes the input sentence (e.g., “Hello, how are you”).
Passes it through multiple layers (
N×
times, typically 6 or more).Each layer has:
Multi-Head Attention: Looks at all words in the input to understand context.
Feed Forward Layer: Transforms the info.
Add & Norm: Ensures stability in learning (like normalizing + residual connection).
🔷 2. Decoder (Right Side):
Generates the output sentence, one token at a time.
Each layer also repeats
N×
times and has three parts:Masked Multi-Head Attention: Looks at only previous tokens (to prevent cheating by seeing future words).
Multi-Head Attention: Looks at encoder's outputs (context from input sentence).
Feed Forward and Add & Norm (same as encoder).
🔑 Additional Components
➕ Positional Encoding:
Since transformers have no concept of word order, we add positional information to embeddings.
➕ Input/Output Embeddings:
These convert words into vectors that the model can understand.
➕ Linear + Softmax:
The final output layer:
Linear: Converts decoder output to vocabulary-size vector.
Softmax: Chooses the most likely next word.
✅ Summary Table
Component | Purpose |
Encoder | Understands the input sentence |
Decoder | Generates the output sentence |
Multi-Head Attention | Helps model focus on different parts of the sentence |
Masked Attention | Prevents decoder from seeing future tokens |
Positional Encoding | Adds word position info |
Add & Norm | Stabilizes training |
Feed Forward | Adds learning capacity |
🆚 GPT vs This Transformer
Feature | Original Transformer (Image) | GPT |
Architecture | Encoder-Decoder | Only Decoder |
Use Case | Translation | Text generation |
Attention | Bidirectional Encoder | Left-to-right Decoder only |
Some Important Keywords:
Tokens: Smallest units of text (like words, subwords, or characters) that a model processes.
Sequence: An ordered list of tokens representing a complete input or output (e.g., a sentence).
Input Tokens: Tokens given to the model as context or prompt to generate or understand text.
Output Tokens: Tokens generated by the model as a response or prediction.
🔹 1. Tokenization
📘 Definition:
Tokenization is the process of breaking a text into smaller units called tokens. These tokens can be words, subwords, or characters depending on the tokenizer used.
✅ Why it matters:
LLMs can't directly understand raw text — it has to be tokenized first.
Tokenisation converts text into numerical tokens the model can process.
import tiktoken
enc=tiktoken.encoding_for_model("gpt-4o")
text="Hello I am Mannu"
tokens=enc.encode(text)
print(tokens)
tokens=[13225, 357, 939, 23959, 84]
decoded=enc.decode(tokens)
print(decoded)
🔹 2. Vector Embeddings
📘 Definition:
Vector embeddings are dense numerical representations of tokens or sentences in continuous vector space. Each word or sentence is mapped to a high-dimensional vector that captures its meaning.
✅ Why it matters:
Embeddings enable semantic similarity comparison.
Power search, clustering, classification, and many LLM tasks.
This is a model with embedding length as 384:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
text = "Dog chases cat"
embedding = embeddings.embed_query(text)
print(embedding)
print(embedding.length)
While this is a model with length 1024
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-roberta-large-v1")
text = "Dog chases cat"
embedding = embeddings.embed_query(text)
print(embedding)
print("Embedding length:", len(embedding))
🔹 What Are Embedding Dimensions?
When a sentence or word is converted into a vector embedding, it is transformed into a list of real numbers of fixed size — called its dimension.
For example:
python
CopyEdit
[0.01, -0.05, 0.32, ..., 0.27] ← A 384-dimensional vector
If you're using a model like all-MiniLM-L6-v2
, the output embedding will always be of 384 dimensions, no matter the input sentence (unless it's too long or truncated).
🔍 Why Do Dimensions Matter?
🔧 Purpose | 💡 Explanation |
Representation Power | Higher dimensions can capture more nuanced patterns and relationships. Think of it like storing more features about meaning. |
Similarity Computation | Embeddings are compared using cosine similarity or Euclidean distance — these rely on the dimensionality to find how close two vectors (meanings) are. |
Downstream Tasks | Classification, clustering, search, and ranking use these vectors — consistent dimensionality is critical for such ML tasks. |
Model Efficiency | Larger dimensions give better expressiveness but come with more compute and memory cost. Models like MiniLM (384-dim) balance speed and performance. |
🧠 Analogy
Imagine you're describing a fruit:
3D: You only say color, size, and taste. (Basic idea)
300D: You describe texture, smell, shape, ripeness, species, etc. (Rich meaning)
The more dimensions, the more semantic detail the vector can encode.
🔢 Summary
Model | Embedding Dimensions |
all-MiniLM-L6-v2 | 384 |
bert-base-nli-mean-tokens | 768 |
text-embedding-3-small (OpenAI) | 1536 |
text-embedding-3-large (OpenAI) | 3072 |
🔹 Why Do We Need Positional Encoding?
Transformers do not have built-in sequence order understanding — they treat input as a set, not a sequence.
So for input like:
"Dog chases cat"
And
"Cat chases dog"
Without positional encoding, the Transformer would see both as the same because it's only seeing the embeddings of individual tokens, not their order.
🔹 What Is Positional Encoding?
Positional Encoding is a way to add information about the position of each token in a sentence to its vector representation (embedding).
Each token’s final input embedding is a combination of:
Word Embedding + Positional Encoding
This allows the model to distinguish between:
- “dog chased cat” vs. “cat chased dog”
🔹 Types of Positional Encoding
Sinusoidal Encoding (used in the original Transformer paper):
Uses mathematical sine and cosine functions to assign each position a unique, fixed vector.
Patterns repeat periodically and help the model generalize to unseen sequence lengths.
Learned Positional Embeddings:
The model learns a trainable vector for each position during training.
Used in GPT and BERT.
🔹 Properties of Positional Encoding
Unique: Every position in a sentence gets a different vector.
Deterministic (in sinusoidal): Same position always gets the same encoding.
Adds sequence-awareness: Lets attention layers differentiate token order.
Compatible: Has the same dimensionality as token embeddings, so they can be added directly.
🔹 Summary
Aspect | Explanation |
Purpose | Inject order information into Transformers |
Problem it solves | Transformers don’t know token sequence |
How it works | Adds a vector (position info) to each token embedding |
Types | Sinusoidal (fixed), Learned (trainable) |
Used in | BERT, GPT, T5, etc. |
🔹 Self-Attention
✅ Definition:
Self-attention is a mechanism that allows each word in a sentence to look at other words and decide how much attention to give them when forming its own representation.
It's the core idea behind how Transformers understand relationships within a sentence.
🎯 Feel:
Imagine reading the sentence:
"The cat sat on the mat because it was tired."
When you reach "it", you ask:
"Who is 'it' referring to?" → Probably "cat"
Self-attention enables the model to do exactly this — for every word, it asks:
"Which other words help me make sense of myself?"
So:
"cat" might focus on "sat"
"tired" might focus on "it" and "cat"
"because" might glance at both clauses
Every word builds a contextual understanding by attending to others — like a room full of people briefly looking around to understand the bigger picture.
🔹 Multi-Head Attention
✅ Definition:
Multi-head attention is an extension of self-attention where multiple attention mechanisms (called heads) run in parallel. Each head captures different kinds of relationships in the sentence.
It's like having multiple “perspectives” to understand the sentence better.
🎯 Feel:
Imagine a group of readers analyzing the same sentence:
One focuses on grammar.
One focuses on semantics (word meaning).
One focuses on pronouns and references.
Another looks for action words.
Each “head” has a different focus — and together, their combined insight creates a much richer understanding of the sentence than one head alone.
So while self-attention is like a person reading a sentence and glancing around,
Multi-head attention is like a team of experts, each analyzing the same sentence from a unique angle — and then combining their thoughts.
📌 Summary
Term | Definition | Feel / Analogy |
Self-Attention | Each word pays attention to other words in the same sentence | Like one person reading a sentence and deciding which words matter most |
Multi-Head Attention | Multiple attention heads working in parallel, each learning a different pattern | Like a group of experts, each focusing on a different aspect of the sentence |
🔹 The Two Phases of a Model
1. Training Phase
2. Inference Phase
🔹 1. Training Phase
✅ Definition:
This is when the model learns. It’s shown lots of data and taught to make predictions. If it makes mistakes, it corrects itself through a process called backpropagation.
🧠 Feel:
Like a student studying from a textbook and practicing questions. Every time they make a mistake, they learn from it and try to improve.
💡 What Happens:
Input data is passed to the model.
The model makes a prediction.
The loss (error) is calculated by comparing the prediction to the actual answer.
Using backpropagation, the model adjusts its internal parameters (weights) to reduce future errors.
🔹 Role of Backpropagation in Training
✅ Definition:
Backpropagation is the method used to compute and apply corrections to the model’s internal parameters based on the error made.
🧠 Feel:
Imagine you solve a math problem and get the wrong answer. You go back step-by-step through your work to see where you went wrong, and you adjust your method.
Backpropagation does exactly this:
It starts at the output (where the error is),
Moves backward through the network,
And updates each layer’s weights to reduce that error.
This process happens millions of times during training, gradually making the model smarter.
🔹 2. Inference Phase
✅ Definition:
This is when the model is already trained and is now used to make predictions on new, unseen data.
🧠 Feel:
Like the same student now taking an exam. No more learning — just applying what they’ve already studied.
💡 What Happens:
The trained model receives new input.
It passes through the network without updating anything.
The output (prediction) is returned.
No loss is computed, no backpropagation, and no weights are changed.
📌 Summary Table
Phase | Purpose | Learning? | Involves Backpropagation? | Analogy |
Training | Teach the model | ✅ Yes | ✅ Yes | Student studying + revising |
Inference | Use the trained model | ❌ No | ❌ No | Student answering an exam |
Subscribe to my newsletter
Read articles from Manunjay Bhardwaj directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
