Generative AI & GPT Explained: A Beginner’s Guide to Transformers, Att

Introduction

Generative AI has become an inseparable part of our journey as developers. In this blog, I’ve tried to keep things crisp, simple, and beginner-friendly — whether you’re just curious about what all the hype is about or you’re an amateur dev looking for a go-to reference for the theory behind this fascinating field.

We’ll start from the very basics — what Generative AI is, where it came from, and why the term “GPT” has become so popular. Then, we’ll dive into the transformer architecture that made it possible, break down the training process into digestible steps, and make sense of jargons like tokenization, embeddings, positional encoding, attention, and backpropagation. So let’s get started.

What is Generative AI?

Generative AI is a branch of Artificial Intelligence mainly concerned with generating some new content, including texts, images, videos, audios, and other modalities. But here’s a catch the model which generates new content is pre-trained with a very large dataset to learn existing patterns. No single company or person introduced the generative AI, it is a broad AI research field enriched by the contributions of numerous researchers, universities.

The term was widely popularized and gained public attention, mainly due to OpenAI, with their introduction of chatGPT. The name is itself self-explanatory: GPT, which stands for Generative Pre-trained Transformer.

Attention is all you need

A research paper published by a team of google, entitled “Attention is all you need” in 2017 with a primary goal to improve Google Translate-style neural machine translation. It laid the foundations of GPTs.

The transformer model became the major building block of the modern large language models like GPT. chatGPT comes directly from the model architecture in that paper.

How GPTs work?

By now, you must have developed a clear idea about generative AI and GPTs. So let’s concentrate on how it works behind the scenes. Basically, a GPT like model has 2 main phases:

Training: Model is trained over large data-sets.
Inferencing: The trained model is used to generate content.

The training part involves both the input & output labels are given simultaneously, and the transformer itself figures out the pattern. The steps involved in training a model can broadly be classified into:

Prepare the data (Tokenization, Embeddings, Positional encoding)
Process the data (Self-Attention, Multi-head Attention)
Learn from mistakes (Backpropagation)

Now, let’s understand each of these jargons in detail:

Tokenization: A machine can’t understand raw text, which is human-readable. So it is the process of converting input text into smaller units called tokens that the model can understand and process.
- A token may be a whole word, a part of a word, or even whitespace, depending on the tokenizer.
- Different large language models implement different tokenization techniques based on its vocabulary database.
  
  Example:

💡

Try out tokenizing your input text using tiktokenizer

Embeddings: Though our input string is converted into token array. Still the individual tokens doesn’t seem to make any sense to the model. After tokenization, each token is mapped to a discrete location in the vector space.
- Embeddings are dense numerical vectors that represent tokens in a way that captures their semantic meaning.
- They convert discrete tokens into continuous vector space, where tokens with similar meanings have similar vector representations.
  
  Example: Words like “king” and “queen” will have vectors that are close together in embedding space, but differ in certain dimensions (like gender).

💡

Check this one out: projector.tensorflow.org

Positional Encoding: Consider these two sentences, “Virat Kohli plays cricket”, “Cricket plays Virat Kohli”, though the sentences have completely different meanings, but they can give rise to same set of embeddings. So to fix such cases positional encoding is used, where information about the position is added to the embedding to handle such cases.

There are two common ways to do this:
- Learned Positional Embeddings – positions are trainable parameters (the model learns position info like it learns word meaning).
- Sinusoidal Positional Embeddings – positions are generated using sine and cosine functions, giving the model a consistent pattern for positions even beyond what it’s trained on.
Self-Attention: Once the tokens with their positional information enter the transformer model, the first core mechanism is Self-Attention, where, each word in the sequence looks at every other word to find out which words are most relevant for understanding its meaning in context.

Example:
Consider these two, “Cricket chirps” & “Cricket bat”, the word “Cricket” has completely different analogy in real life, due to the other word along with them.
Multi-Head Attention: Instead of running self-attention just once, Transformers run it in multiple heads in parallel.
- Each head focuses on different types of relationships between tokens — one head might capture grammatical structure, another might capture thematic meaning, and so on.
- The outputs from all heads are then concatenated and combined, giving the model a richer and more nuanced understanding of the input.

💡

Think of it like looking at the same sentence through multiple “interpretation lenses” at the same time!

Backpropagation: After processing the data through several Transformer layers, the model produces a prediction. If the prediction is wrong, the loss function calculates how far off it was from the correct answer.
- Backpropagation then works in reverse through the network, adjusting the model’s parameters, including embeddings, attention weights, and other learned values to reduce future errors.
- This process repeats millions or even billions of times during training until the model achieves acceptable accuracy.

These are essentially the steps involved in training a transformer model. Similarly, same steps are followed while inferencing the model, the lack of output label makes the only difference.

Conclusion

That brings us to the end of this blog — a complete walkthrough of how Generative AI models like GPT function, starting from raw input text to context-aware, human-like output. By now, you should have a much clearer idea about Generative AI and how it learns, and generate content. But this is just the beginning — there’s an entire universe of AI advancements, fine-tuning techniques, and multi-modal models waiting to be explored.

In the next part, we’ll discuss practical applications of Generative AI, including how to use APIs, experiment with fine-tuning, and even build your own simple AI-powered apps.

Hope you enjoyed reading this post as much as I enjoyed putting it together. Your feedback means a lot — it helps me improve and bring you more value-packed content.

Stay tuned for the next chapter in our AI journey. Until then, keep exploring, and keep your curiosity alive!

Introduction to GenAI - All The Theory You Need To Know

Table of contents