Stepping Into the Gen AI World: A Beginner's Guide to All the Buzzwords

What is OpenAI GPT?
Let’s start with the name that’s on everyone’s lips: OpenAI GPT.
OpenAI: This is the company behind popular AI models like ChatGPT, DALL·E, and Codex. OpenAI’s mission is to ensure that artificial general intelligence (AGI)—highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity.
GPT stands for Generative Pre-trained Transformer.
Let’s break that down:
Generative
It means the model can generate new content—like writing essays, answering questions, creating poems, or even writing code.Pre-trained
Before you even interact with it, the model has already been trained on a massive dataset pulled from books, websites, and other online content. This pre-training gives it a general understanding of how language works.Transformer
This refers to the underlying architecture used to build GPT. It's a kind of neural network that uses something called self-attention to understand relationships between words in a sentence.
Together, GPT is a model that generates human-like text based on patterns it learned from a massive amount of training data.
Now let us dive deep into Transformers:
When we say transformer in the Gen AI world, we’re not talking about Optimus Prime—we’re talking about one of the most revolutionary deep learning architectures ever created.
The Transformer was introduced in a 2017 research paper titled “Attention is All You Need” by Vaswani et al. It changed the way we approach tasks like translation, summarization, and text generation.
This diagram shows how text flows through a Transformer model, which consists of two main parts: the Encoder (left side) and the Decoder (right side). Let’s go layer by layer.
Before diving into the transformer let us first understand what is tokenization
What is Tokenization?
Before a model like GPT can understand or generate anything, the first step is tokenization—breaking the input text into smaller parts called tokens.
A token can be:
A whole word (
"apple"
)Part of a word (
"play"
and"ing"
)Or even just punctuation (
"!"
)
For example:
"Transformers are powerful!"
gets split into:
Tokens: ["Transform", "ers", "are", "power", "ful", "!"]
Token IDs: [10521, 982, 401, 8921, 110, 0]
(example numbers)
These token IDs are what actually go into the model—not the original text.
Each model has a vocabulary—a big list of all the tokens it knows. GPT-3, for example, has about 50,000 tokens in its vocabulary. More tokens = more flexibility, but also more complexity.
In short:
Tokenization is the first step that turns words into numbers, so the model can make sense of them.
What Are Encoders in Transformers?
In the world of transformers, encoders are the first key component that processes the input. Their job is to read the input data (usually a sequence of tokens or words) and turn it into something the model can understand, a contextualized representation.
The Goal of the Encoder
Think of an encoder like a translator. It takes in the raw text (like "The cat sat on the mat") and turns it into a form that the model can understand better, called an embedding (a high-dimensional representation of the input).
The encoder does this by capturing the relationships between words in the sentence, so that the model can later generate accurate and context-aware outputs.
In short the process of converting user inputs into tokens can be termed as Encoding.
A simple code snippet showing how encoding can be done
What Are Decoders in Transformers?
In transformers, the decoder is the part of the architecture responsible for generating output. While the encoder focuses on understanding and processing the input, the decoder takes the information from the encoder (or context in some models) and generates the desired output—whether it's text, translation, or even a continuation of a story.
The Goal of the Decoder
Think of the decoder like a creative writer or a translator. Its job is to take all the knowledge gathered by the encoder and produce meaningful output.
For example, in a translation task, the encoder would process the input sentence ("The cat sat on the mat"), and the decoder would generate the translated sentence ("El gato se sentó en la alfombra").
In short the process of converting tokens back into text can be termed as decoding.
A simple code snippet showing how decoding can be done
Let's dive into vector embeddings and explain what they are and how they work, especially in the context of transformers and Gen AI.
What Are Vector Embeddings?
In the world of AI, vector embeddings are a way to represent words, sentences, or even entire documents as numerical vectors (lists of numbers). These vectors are high-dimensional representations that capture the semantic meaning of the input in a mathematical form.
The idea behind embeddings is that similar words or pieces of text should be closer together in this high-dimensional space, and dissimilar ones should be farther apart. This allows the model to work with numbers instead of raw text, while still preserving relationships between the words.
Why Do We Use Vector Embeddings?
Natural language is complex, and the meaning of words can change depending on context. In a sentence like “He saw the bank,” the word "bank" could refer to a financial institution or the side of a river. Vector embeddings allow the model to understand this nuance by representing the word "bank" in a way that reflects its context.
Once the words are represented as vectors, the model needs to figure out how each word is related to others in the sentence. That's where Self-Attention comes in.
Self Attention
Self-attention helps the model decide which parts of the sentence matter most for understanding each word.
For example, when processing the word "cat" in the sentence "The cat sat on the mat," the model looks at all the other words to decide which ones are important for understanding "cat." It pays more attention to words like "sat" and "mat" because they describe what the cat is doing and where.
In short, self-attention helps the model focus on the right parts of the sentence to get a better understanding of each word in context.
Multi Head Attention
Multi-head attention is like having multiple "eyes" that look at different parts of a sentence simultaneously. Each "eye" (or "head") focuses on a different aspect of the sentence, helping the model capture a richer understanding.
For example, while one "head" might focus on the relationship between "cat" and "sat," another could focus on "mat" and "sat." After all the heads have looked at the sentence, their insights are combined to form a more complete understanding.
This approach allows the model to capture various relationships and nuances in the sentence at the same time.
SoftMax
Softmax is a function that takes a bunch of numbers and turns them into probabilities. It makes sure that the numbers are scaled in a way that they all add up to 1, which means you can interpret them as probabilities.
For example, if you have three numbers: 2, 1, and 0.5, Softmax would convert them into values between 0 and 1, with the highest number getting the highest probability. So, it helps decide which option is the most important or likely.
Temperature
Temperature is a setting used to control how random or predictable the model’s responses are.
Low temperature (close to 0) makes the model more confident and predictable. It tends to give safer, more common responses.
High temperature (e.g., 1 or higher) makes the model more creative and random. It gives more varied and diverse responses, but they might be less reliable.
In essence, temperature adjusts how much "spice" or randomness the model adds to its output.
Difference between SoftMax and Temperature
Softmax: It turns a list of numbers into probabilities that add up to 1. It helps the model decide which option is most likely.
Temperature: It changes how "confident" the model is when making decisions.
Low temperature makes the model pick the most likely option (more predictable).
High temperature makes the model pick from a wider range of options (more random).
In short: Softmax decides the probabilities, and temperature controls how risky or safe the model's choices are.
Knowledge Cutoff
Knowledge cutoff refers to the point in time when a model like GPT stops learning from new data. This means that any information, events, or developments that happen after the cutoff date are unknown to the model.
For example, if a model's knowledge cutoff is in 2021, it won't be aware of anything that occurred after that year, such as new scientific discoveries, events, or trends.
Subscribe to my newsletter
Read articles from Karan Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
