AI Jargon Explained Over Chai

Before we dive into decoding AI, it's crucial to understand why it matters. In today's world, it's easier than ever to be misled by fake news, deepfake images, or manipulated information. In this fast-paced digital age, rumors can spread like wildfire. As our attention spans shrink, we often accept information at face value instead of investigating further.

If you're already losing focus after these few lines, you might want to stop here — because what follows is a deep, no-nonsense investigation. By the end of this, you'll truly understand how ChatGPT, Gemini, Claude, and DeepSeek work. And who knows — those confusing AI buzzwords might not seem so confusing anymore.

Let’s simplify GPT

GPT stands for Generative Pre-trained Transformer. It's trained on a massive amount of text data, but it doesn't actually "understand" language the way humans do — it understands numbers and patterns. Its strength lies in predicting the next word or token based on the ones that came before.

To explain it simply, imagine you're watching a crime thriller. During the intermission, a detective is trying to figure out who the kingpin is. He only has a few clues, like the ones you see in the image below.

Surrounded by evidence, notes, and connections, the detective looks for patterns to figure out who might be behind the disappearances. He weighs different possibilities and, through logic and deduction, makes his best guess about who the criminal mastermind is.

In the same way, GPT works by analyzing patterns in text and making its best prediction about what comes next. Just like the detective, it's not always perfect, but it's surprisingly accurate. Hopefully, that gives you a clearer picture of what GPT actually does.

GPT based on Transformer Architecture

GPT operates based on a few fundamental components: token embeddings, positional encoding, the self-attention mechanism, multi-head attention, and a feed-forward neural network. In addition to this architecture, GPT goes through two major phases:

Training Phase: During training, the process looks like this: Prediction → Loss → Backpropagation → Weight Update → Repeat This cycle runs millions of times as the model learns patterns from text data.

Inference Phase: When generating text (inference), the flow is: Input Text → Transformer → Logits → Linear Layer → Softmax → Next Token → Repeat until complete
** If you're interested in understanding the Transformer architecture in more depth, you can explore the original research paper here:
Attention Is All You Need – NeurIPS 2017

Tokenization: The First Step in Understanding Language Models

Here, you can see a LEGO house built from many individual bricks. In this analogy, the house represents a sentence, and each brick represents a word. Every word, in turn, is converted into numbers like 72, 101, 108, 108, 111, 44, 32, 87, 111, 114, 108, 100, 33.

GPT doesn’t understand words directly — it only understands these numbers, which are called tokens.
Let’s see how this works through the code example below. This code shows a very simple version of how text is converted to numbers (tokens) and then converted back to text — just like what GPT does behind the scenes.

def convert_chars_to_token(all_characters):
    char_to_ascii = []
    for char in all_characters:
        ascii_value = ord(char)
        char_to_ascii.append(ascii_value)
    return char_to_ascii

What it does “convert_chars_to_token(all_characters)” :

It takes a string (e.g., "Hello").
Goes through each character in the string.
Converts that character into its ASCII number using ord().
Stores the numbers in a list and returns it.

Example:

Input: "Hi"
Output: [72, 105]

def decode_token(ascii_values):
    decoded_chars = []
    for index in ascii_values:
        character = chr(index)
        decoded_chars.append(character)
    decoded_string = ''.join(decoded_chars)
    return decoded_string

What it does “decode_token(ascii_values)” :

Takes a list of ASCII numbers (tokens).
Converts each number back to a character using chr().
Joins all the characters into a single string again.

Example:

Input: [72, 105]
Output: "Hi"

This is a basic simulation of how language models like GPT process text:

Text → Tokens → Process → Tokens → Text
GPT doesn’t work with raw text — it works with numbers (tokens), just like this code does.

Vector- Embeddings

Vector embeddings transform tokens into vectors — arrays of numbers that capture their semantic meaning. This allows tokens with similar meanings to be placed closer together in the vector space.

Let’s understand this with a real-world analogy:

Imagine you're searching "nearby ATMs" on Google Maps. Here's what happens:

Your location coordinates are sent to Google.
Based on your position, Google shows you ATMs that are near you.
It doesn't show restaurants, and it doesn’t show ATMs from another city — only those that are relevant and nearby.

Now, think of words (tokens) like "king" and "queen". Even though they are different tokens, their meanings are related — so they appear close together in the vector space, just like nearby places on a map.

In short, vector embeddings create a map of meaning, where tokens with similar context or meaning are grouped together — just like how ATMs are grouped by physical location on a map.

Positional Encoding

Let’s say you’re living in a planned city where all the houses look exactly the same — identical in design and color. Now, how do you figure out which house is yours and which one belongs to your friend?

The answer: by looking at the plot number or address.

Even though the houses (like words) look the same, their position gives them unique meaning.

Now consider these two sentences:

"The cat chased the mouse."
"The mouse chased the cat."

They contain the same words, but in a different order — and that completely changes the meaning. Without positional information, a model wouldn’t know who is chasing whom.

So what does positional encoding do?

It helps GPT understand:

What the word is → through token embeddings
Where it is in the sentence → through positional encoding

Positional encoding is like giving each word an exact address in the sentence — just like a GPS needs both the location (coordinates) and the house number to be useful.

Self-Attention Mechanism

Imagine you’re sitting at a dinner table with 10 people. Someone says a sentence like:
“She poured water into the glass because it was empty.”

Now you’re trying to figure out — what does "it" refer to? The glass or the water?

Your brain quickly looks back and considers both possibilities in the sentence and chooses the most likely one.
That’s what self-attention does:

It allows each word (like “it”) to look at and focus on other words in the sentence to understand context.

Every word gets a chance to “pay attention” to all other words in the input — deciding which ones are important for its meaning.

Multi-Head Attention

Let’s say you're analyzing a painting.
One lens helps you see colors, another focuses on shapes, another highlights textures. Each lens gives you a different perspective — and when you put them all together, you understand the painting much better.

That’s what multi-head attention does:

It allows the model to look at the same sentence in multiple ways — focusing on different relationships and patterns in parallel.

Each “head” attends to different aspects of the sentence, and then they’re all combined to form a richer understanding.

Feed-Forward Neural Network

You walk in and say, “I want something elegant, maybe with a touch of gold, but not too flashy.”

The salesperson immediately scans the entire collection.
They:

Pay more attention to items that match your description.
Ignore the ones that don’t fit.
Compare details like color, shine, design, and style.
Even look at how each piece relates to the others in the collection.

Now that the salesperson has picked a few great options,
you meet the jeweler/stylist who:

Refines the choices
Maybe adjusts the length of a chain
Suggests a better matching pendant
Polishes the piece for final presentation

It takes the focused, context-aware information and transforms it into something complete and ready for the next step (like the output layer or next layer in the model).

It's just like processing raw understanding into a final, useful form.

Conclusion

The details of the training and inference phases are explained in the article I shared earlier.
To wrap things up — the analogies provided were simply meant to make these concepts easier to understand.
Now that you’ve got a clearer picture, it’s important to remember.
GPT’s job is to predict the next token, not to replace you.
Instead, you can use GPT as a powerful tool to boost your productivity and efficiency.

#chaicode

Decoding AI jargons with chai

Let’s simplify GPT

GPT based on Transformer Architecture

Tokenization: The First Step in Understanding Language Models

What it does “convert_chars_to_token(all_characters)” :

Example:

What it does “decode_token(ascii_values)” :

Example:

Vector- Embeddings

Positional Encoding

Self-Attention Mechanism

Multi-Head Attention

Feed-Forward Neural Network

Conclusion

Subscribe to my newsletter

Himandri Mallick

Himandri Mallick