The Inner Workings of Generative AI Explained

Pankaj KumarPankaj Kumar
10 min read

The user typed Hello, and ChatGPT responded with Hey! How's it going? Let's decode this using computer science.

Let's break down each phase into steps.

  1. GPT

  2. Transformers

  3. Tokenization

  4. Vector Embedding

  5. Positional Encoding

  6. Self-Attention/Multi-Headed Attention

  7. Neural Networks

  8. A model has two phases: the training phase and the inference phase.

    1. Training phase

    2. Inferencing


1 - GPT (Generative Pre-trained Transformer)

Lets understand the meaning of each work of GPT

Generative

  • Meaning: It means the model is capable of generating something — in this case, generating text.

  • In simple words: The model can create sentences, paragraphs, or even whole articles that look like they were written by a human.

Example:
If you give GPT a prompt like "Once upon a time," it can generate the rest of the story.

Pre-trained:

  • Meaning: Before you ask the model to do specific tasks (like answering questions, writing essays, summarizing, etc.), it is already trained on a large amount of general data (like books, articles, websites) in advance.

  • In simple words: It learns first, then fine-tunes or uses that knowledge to help you.

Example:
Imagine a student who has read thousands of books before taking an exam — that's what "pre-trained" means. GPT has "read" a huge amount of text before being used to do any specific task.

2 - Transformer

  • The Transformer model was introduced to "transform" input sequences (like sentences) into output sequences (like translations or answers).

  • It transforms information using attention mechanisms, not old methods like RNNs.

  • Instead of reading step-by-step, it pays attention to all parts of the input at once and transforms them into something new.

This refers to the model architecture — the Transformer model introduced in the paper "Attention Is All You Need" by Google in 2017 for language translation.

Let's understand this transformer model.


3 - Tokenization (text → tokens → numbers)

👉 Tokenization means breaking text into small pieces called tokens so that the model can understand and process them.

  • A token can be:

    • A word (e.g., "apple"),

    • A part of a word (e.g., "ap" + "ple"),

    • Or even punctuation (e.g., "." or ",").

The model doesn't work directly with full sentences or paragraphs — it only understands numbers. So tokenization turns text into smaller units, which are then converted into numbers for the model. You can use this website to visually understand the tokenization.

Let’s convert the sentence into numbers using gpt-4o.

Example:- "The people sat on the floor"

import tiktoken

enocder =  tiktoken.encoding_for_model("gpt-4o")

text = "The people sat on the floor"

tokens = enocder.encode(text)

print(tokens) # [976, 1665, 10139, 402, 290, 8350]

OUTPUT: "The people sat on the floor" → [][][][][][][][] → [976, 1665, 10139, 402, 290, 8350]


3 - Vector Embedding

👉 Vector embedding turns those numbers into dense vectors — basically, meaningful groups of numbers that the model can understand better.

Step-by-Step:

1. Tokenization:

  • Break the text into tokens.

  • Example:
    "Cats are cute." → ["Cats", "are", "cute", "."]

2. Convert tokens to numbers (IDs):

  • ["Cats" → 512, "are" → 71, "cute" → 400, "." → 13]

3. Vector Embedding:

  • Now, instead of treating 512, 71, 400, etc. as simple numbers,
    each number is mapped to a high-dimensional vector — a list of floating-point numbers (like 128 or 768 numbers long, depending on the model size).

  • Use this website to understand it better.

Example:

  • 512 → [0.3, -0.1, 0.7, ..., 0.2] (say, a 768-dimensional vector)

  • 71 → [-0.4, 0.8, -0.2, ..., -0.7]

  • 400 → [0.5, -0.9, 0.1, ..., 0.0]

➡️ These vectors capture the meaning of the token, not just its ID.

Semantic meaning:-

✅ After tokenization and vector embedding, each word (or token) is represented as a vector.
✅ These vectors are positioned in a way that captures meaning based on training on huge amounts of text.

How does it learn the meaning?

  • During pretraining, the model (like GPT) reads billions of sentences.

  • It adjusts the embeddings so that:

    • Words that often appear together (like "king" and "queen", or "dog" and "bark") have similar vectors.

    • Words that are related (semantically or logically) are placed closer in the vector space.


Example of semantic meaning in embeddings:

✅ After training:

  • "King" → vector like [0.8, 0.3, ...]

  • "Queen" → vector like [0.79, 0.32, ...]
    (Very close!)

✅ Also cool:

  • King - Man + Woman ≈ Queen
    (Using just vector math! That's semantic understanding in numbers.)

Code to print embedding

from dotenv import load_dotenv
from openai import OpenAI

# load you api key here
load_dotenv()

client = OpenAI()
text = "Eiffel Tower is in Paris and is a famous landmark, it is 324  meters tall"

response = client.embeddings.create(
    input=text,
    model="text-embedding-3-small"
)

print(response.data[0].embedding)

5 - Positional Encoding

Positional Encoding is a way to tell the Transformer about the order of the words.

Why do we need Positional Encoding?

  • Transformers look at all words at once (parallel processing, not step-by-step).

  • That means the model doesn’t naturally know if "The cat sat" is different from "Sat cat the" — because it sees all tokens at the same time.

Positional Encoding gives information like:

"Hey model! 'The' comes first, 'cat' comes second, 'sat' comes third!"

How does it work?

  • For each position (word number), we create a special vector called a position vector.

  • These vectors are added to the token embeddings.

Example:

| Token | Embedding (meaning) | + | Positional Encoding (order) | = | Final input to model | |:---|:---|:---|:---|:---| | "The" (position 0) | [0.4, -0.1, 0.5...] | + | [0.8, 0.2, -0.5...] | = | [1.2, 0.1, 0.0...] | | "cat" (position 1) | [0.1, 0.7, -0.2...] | + | [0.3, 0.9, 0.6...] | = | [0.4, 1.6, 0.4...] | | "sat" (position 2) | [-0.2, 0.3, 0.5...] | + | [0.7, 0.1, -0.7...] | = | [0.5, 0.4, -0.2...] |

The model uses both meaning (from token embedding) and order (from positional encoding).


6 - Self-Attention/Multi-Headed Attention

👉 Self-Attention is a technique that lets each word (token) look at all other words in the sentence and decide who is important to it.

Example:

Sentence:

"The cat sat on the mat."

Now, when processing the word "sat",

  • the model pays attention to "cat" (Who sat?)

  • a little less to "mat" (Where did it sit?)

  • very little to "the" (small grammar word).

👉 Each word dynamically looks around and gathers important context.

Simple idea:

  • "sat" cares most about "cat".

  • "mat" cares most about "on" and "the".

  • "cat" might care most about "the".

Each word builds an understanding based on its relationships with others!

How Self-Attention Works (in steps):

✅ For every word, the model creates three vectors:

  • Query vector (Q) — What am I asking about?

  • Key vector (K) — How important am I to others?

  • Value vector (V) — What information do I carry?

✅ The model compares Query vs Key to calculate attention scores (how much to attend to each word).

✅ Then it mixes the Values based on those scores!

Small flowchart:

vbnetCopyEditEach word ➡️ Create (Q, K, V) ➡️ Compare Q and K ➡️ Get scores ➡️ Mix V based on scores ➡️ Output

Quick micro-example:

Imagine "sat" is processing:

WordAttention Score (importance to "sat")
the0.1
cat0.8
sat1.0 (itself)
on0.3
the0.1
mat0.5
  • So "sat" looks heavily at "cat" and "mat" but not much at "the".

Multi-Head Attention:

👉 Multi-Head Attention = Many self-attentions happening in parallel, each looking at the words in different ways.

Why multiple heads?

✅ One attention head might focus on subjects ("who is doing what").
✅ Another head might focus on locations ("where things happen").
✅ Another head might focus on tense ("past, present").

Each head learns different relationships!

✅ After that, the model combines (concatenates) all heads together to form a full understanding.

Small analogy:

Think of a detective using multiple magnifying glasses 🔎🔎🔎 to look at a crime scene —
One glass shows footprints, another shows fingerprints, another shows fibers — then he puts it all together.


7 - Neural Networks

👉 Neural networks are everywhere inside a Transformer!
But specifically, they come mainly in two places:

Feed-Forward Neural Network (FFN) — after Self-Attention

✅ After the model does Self-Attention (where words look at each other),
✅ Each token (word) goes through a tiny Neural Network separately.

That little network is called a Feed-Forward Neural Network (FFN).

What happens there?

  • It transforms the information about each word individually.

  • It applies non-linearity (meaning: helps the model learn more complex patterns, not just simple lines).


Simple flow:

vbnetCopyEditWord's Attention Output ➡️ Feed-Forward Neural Network ➡️ Transformed Output

Each word's information is updated independently!

Example:

Imagine the word "cat" after attention.
It goes into a 2-layer small neural network:

  1. First layer: Multiply by weights + add bias → apply ReLU (activation).

  2. Second layer: Multiply by new weights → final output.

✅ It's just like a mini brain helping each word think better before moving to the next layer.


8 - Model has two main phases: Training and Inference

Training Phase

👉 Training = Teaching the model how to understand language.
It’s like school for the model! 📚

In Training:

  • Model sees millions or billions of examples (sentences, paragraphs).

  • It learns to predict words, relationships, meanings, patterns.

  • Adjusts its internal parameters (weights) using a method called backpropagation.

How Transformer trains:

✅ Step-by-step:

StepWhat Happens
Input TextSentences are tokenized into tokens.
EmbeddingTokens are turned into vectors.
Positional EncodingAdd order information to embeddings.
Attention + Neural NetworksThe Transformer reads relationships between tokens.
PredictionModel predicts the next token or missing words.
Loss CalculationCompare prediction vs correct answer (calculate "loss").
BackpropagationAdjust model weights to do better next time.

This repeats millions of times until the model becomes smart!

Training Example:

Sentence:

"The cat sat on the ___."

The model might predict:

  • "mat" (correct!) ✅

  • "hat" (wrong) ❌

Based on the result, it updates its parameters to get better.

Training needs a lot of resources:

  • GPUs, TPUs, huge datasets, long time (days or weeks).

Inference Phase

👉 Inference = Using the trained model to make predictions.
It’s like graduated student solving problems! 🎓🧠

In Inference:

  • The model is already trained.

  • It does not learn anymore — it just reads input and gives output.

  • It uses fixed weights (no updating, no backpropagation).

How Transformer does inference:

✅ Step-by-step:

StepWhat Happens
Input TextYou give a prompt ("Once upon a time...").
TokenizationBreak text into tokens.
Embedding + Positional EncodingTurn into vectors + add order info.
Attention + Neural NetworksRead relationships between tokens.
Generate PredictionOutput next token / next word / complete text!

Inference Example:

You input:

"The sun is shining and the sky is"

Model outputs:

"blue."

✅ No learning, just prediction!

🎯 Summary:

PhasePurposeModel Action
TrainingLearn from examplesAdjusts weights by learning from mistakes.
InferenceUse the trained knowledgePredicts based on fixed weights (no more learning).

✨ Super simple definitions:

  • Training = "Learn how language works." 📚

  • Inference = "Use what I learned to answer you!" 🤖

10
Subscribe to my newsletter

Read articles from Pankaj Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Pankaj Kumar
Pankaj Kumar