Understanding Generative AI: From Tokens to Transformers

Kartikey JaiswalKartikey Jaiswal
10 min read

Introduction

Generative Artificial Intelligence(GenAI) is any type of AI that can be used to generate new and original content based on patterns and examples from training data it has learned.

It analyzes vast amounts of data, looking for patterns and relationships, then uses these insights to create fresh, new content that mimics the original dataset. It does this by leveraging machine learning models.This content can be text, images, video, code, or synthetic data. Examples include ChatGPT, Claude, DALL-E and Midjourney.

In this post, I’ll break down key concepts like GPT, transformers, tokenization, vector embeddings, and more — explaining how they work together to power generative AI models.

Generative Pre-trained Transformer (GPT)

Generative Pre-trained Transformers (GPT) are a type of deep learning model used to generate human-like text. Let's break down the acronym:

  • Generative: Generative AI is a technology capable of producing content, such as text and imagery.

  • Pre-trained: Pre-trained models are saved networks that have already been taught, using a large data set, to resolve a problem or accomplish a specific task.

  • Transformer: A transformer is a deep learning architecture that transforms an input into another type of output.

There are endless applications for GPT models, and you can even fine-tune them on specific data to create even better results. By using transformers, you will be saving costs on computing, time, and other resources. Unlike rule-based systems, GPT learns patterns and structures in text data to generate human-like responses.

Common uses include

  • Answering questions

  • Summarizing text

  • Translating text to other languages

  • Generating code

  • Generating blog posts, stories, conversations, and other content types.

What are Transformers?

Transformers - The Heart Of GPTs, it’s like the Matrix of Leadership, which allows Optimus Prime to leverage the knowledge of his ancestors to inform his decisions.

A transformer is a type of artificial intelligence model that learns to understand and generate human-like text by analyzing patterns in large amounts of text data. Transformers are a type of deep learning architecture that revolutionized how we build AI models — especially language models like ChatGPT, GPT-4, Claude, etc.

They are designed to process input as tokens and predict the next token, using attention mechanisms to understand the relationship between words.

So, How do they do it?

They are specifically designed to comprehend context and meaning by analyzing the relationship between different elements, and they rely almost entirely on a mathematical technique called attention to do so.

Originating from a 2017 research paper by Google, transformer models are one of the most recent and influential developments in the Machine Learning field. The first Transformer model was explained in the influential paper "Attention is All You Need”.

Transformers are like very smart sentence guessers.

Imagine you're playing a game where your job is to guess the next word in a sentence, like:

“Once upon a…”

You might guess: “time.”

Then the game goes on:

“Once upon a time, there…”

You guess: “was.”

Transformers do this guessing — but way faster, smarter, and using everything they’ve learned from reading the internet.

How AI (Transformer) Work? Its not some magic, Its maths

  1. Input Sentence

    "Once upon a time"

  2. Tokenization

    ["On”][”c”][”e"]["up”][”on"]["a"]["tim”][”e"]

    Tokenization is one of the first and most important steps in natural language processing (NLP). It’s the process of breaking down a piece of text — like a sentence or paragraph — into smaller units called tokens.

    This step is crucial because machines don't understand raw text — they understand numbers. So once the text is tokenized, each token is mapped into numbers (called vector embedding), and then the model can start learning or predicting.

  3. Vector Embedding

Vector embeddings are digital fingerprints for words or other pieces of data. Instead of using letters or images, they use numbers that are arranged in a specific structure called a vector, which is like an ordered list of values.

Imagine each vector as a point in a multi-dimensional space, where its location carries vital information about the represented word or data. Each vector is like a unique identifier that not only encapsulates a word's meaning but also reflects how this word relates to others. Words with similar definitions often have vectors close together in this numerical space, just like neighboring points on a map. This proximity reveals the semantic connections between words.

The following 3D scatter plot visualizes the concept of vector embeddings for words. Each point in the space represents a word, with its position determined by its vector embedding. The blue points clustered together represent animal-related words (“Cat,” “Dog,” “Pet,” “Animal”), while the red points represent vehicle-related words (“Car,” “Vehicle”). The proximity of points indicates semantic similarity—words with related meanings are positioned closer together in this vector space.

For example, the word "cat" might have a vector like this: [0.9, 0.2, 0.7, 0.3, 1, 0, 0, 0, 0.4, 0.8, 0.9] and "freedom" might be: [0.1, 0.8, 0.6, 0.7, 1, 0, 0, 0, 0.7, 0.3, 0.2].

For instance, “Cat” and “Dog” are near each other, reflecting their shared characteristics as common pets. Similarly, “Car” and “Vehicle” are close, showing their related meanings. However, the animal cluster is far from the vehicle cluster, illustrating that these concept groups are semantically distinct.

This spatial representation allows us to visually grasp how vector embeddings capture and represent the relationships between words. It transforms linguistic meaning into geometric relationships that can be measured and analyzed mathematically.

Vector embeddings are a valuable technique for transforming complex data into a format suitable for machine learning algorithms. By converting high-dimensional and categorical data into lower-dimensional, continuous representations, embeddings enhance model performance and computational efficiency while preserving underlying data patterns.

  1. Positional Encoding

Once we’ve converted each token into a vector embedding, we hit a problem:

Transformers have no built-in sense of word order.

That means the model can’t tell:

  • Whether "I love you" is different from "You love I"

  • Or whether "not happy" means the opposite of "happy"

So we need a way to tell the model the position of each word in the sentence.

Solution: Positional Encoding

Positional encoding adds a unique signal to each token’s embedding that tells the model its position in the sequence.

This encodes each token's position in the sequence into a positional embedding and adds them to the token embeddings to capture the positional information. The token and positional embeddings usually have the same dimensionality for easier addition.

How these positional encodings are generated?

These positional embeddings are generated using an equation, that uses the token's position, and the sin and cosine mathematical functions, to generate unique positional embeddings. Sin is used for even embedding values, and cosine is used for odd values.

  1. Attention

Atention is like giving AI a highlighter and the ability to look back.

Attention mechanisms help language models understand complex structures and represent text more effectively by focusing on important words and their relationships. To better understand how attention works, consider reading a mystery book. As you would focus on clues while ignoring less important content, attention enables models to identify and concentrate on crucial input data.

How does attention work?

Let’s consider the word “bat” in these two sentences:

  1. "Swing the bat!"

  2. "The bat flew at night."

Traditional embedding methods assign a single vector representation to “bat,” limiting their ability to distinguish meaning. Attention mechanisms, however, address this by computing context-dependent weights.

They analyze surrounding words ("swing" versus "flew") and calculate attention scores that determine relevance. These scores are then used to weight the embedding vectors, resulting in distinct representations for "bat" as a sports tool (high weight on "swing") or a flying creature (high weight on "flew").

This allows the model to capture semantic nuances and improve comprehension.

Types of Attention Mechanisms :-

  • Self-Attention - Self-attention weighs the importance of each word in a sentence based on the context to capture long-range dependencies.

  • Multi-head Attention - Multi-head attention takes self-attention to the next level by splitting the input into multiple "heads". Each head focuses on different aspects of the relationships between words, allowing the model to learn a richer representation of the text.

Let's look at an example, starting with attention and later extending it to differentiate between self and multi-head attention.

In a group conversation at a party, it is common to selectively pay attention to the most relevant speakers to understand the topic being discussed. By filtering out background noise or less important comments, individuals can focus on the key points of the conversation and understand what is being discussed.

Self-attention can be compared to focusing on each person's words in the group conversation and evaluating their relevance in relation to other people's words. This technique enables the model to weigh each speaker's input and combine them to form a more comprehensive understanding of the conversation.

Multi-head attention involves splitting attention into multiple "channels" that simultaneously focus on different aspects of the conversation. For instance, one channel may concentrate on the speakers' emotions, another on the primary topic, and a third on related side topics. Each aspect is processed independently, and the resulting understandings are merged to gain a holistic perspective of the conversation.

These two attention mechanisms work together to give the model a comprehensive understanding of the sentence.

How Both Attention mechanisms work together?

Consider the following sentence:
"The boy went to the store to buy some groceries, and he found a discount on his favorite cereal."

  • Attention: "boy", "store", "groceries", and "discount"

    The model pays more attention to relevant words such as "boy", "store", "groceries", and "discount" to grasp the idea that the boy found a discount on groceries at the store.

  • Self-Attention: "boy" and "he" → Same Person

    When using self-attention, the model might weigh the connection between "boy" and "he" recognizing that they refer to the same person. It also identifies the connection between "groceries" and "cereal" as related items within the store.

  • Multi-Head Attention: Multiple channels

    • Character ("boy")

    • Action ("went to the store," "found a discount")

    • Things involved ("groceries," "cereal")

Multi-head attention is like having multiple self-attention mechanisms working simultaneously. It allows the model to split its focus into multiple channels where one channel might focus on the main character ("boy"), another on what he's doing ("went to the store," "found a discount"), and a third on the things involved ("groceries," "cereal").

Training and Inference: How Transformers Learn and Perform

Now that we’ve explored the core steps of a transformer —
Tokenization → Vector Embeddings → Positional Encoding → Attention

It’s time to see how these steps are used during the two main phases of a transformer’s lifecycle: training and inference.

Training Phase: Learning from Data

In the training phase, the transformer is exposed to vast amounts of data and learns by example.

  • It runs the same steps we covered — tokenization, embedding, positional encoding, and attention — on the input text.

  • It tries to make predictions (like the next word).

  • It compares those predictions to the correct answers.

  • Then it adjusts its internal parameters(weights) and uses backtracking to again run those steps improve its performance and give the result nearest to correct answer — this is learning.

Training is where the model becomes intelligent, step by step, by refining how it understands language.

Inference Phase: Generating Predictions

In inference, the model uses what it learned during training to respond to real-world inputs.

  • It runs the exact same processing pipeline — tokenization, embeddings, positional encoding, attention — but now, no learning happens.

  • Instead, the model predicts, generates, or answers based on the knowledge it has stored in its trained weights.

Inference is the model doing what it was trained to do — whether that’s chatting with you, translating a sentence, or writing a poem.

Same Steps, Two Different Goals

Whether it’s learning from data during training or generating responses during inference,
the transformer always relies on the same foundational steps we have just learned above.


From breaking text into tokens to attending across entire sequences, transformers use a brilliant sequence of steps to understand and generate language — both while learning (training) and while performing (inference).
Understanding this flow not only demystifies modern AI, but puts you right at the heart of how today’s smartest systems work.

3
Subscribe to my newsletter

Read articles from Kartikey Jaiswal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Kartikey Jaiswal
Kartikey Jaiswal