#1 - Demystifying AI

Introduction

The word "AI" seems terrifying to some and mysterious to others at first glance. Movies like The Terminator series have implanted fear in the public, fostering skepticism or even hatred toward AI advancements. But fret not—AI is not as enigmatic as it seems. In fact:

AI => Data + Algorithm

That’s it!

AI is not here to "take away jobs" or "lead the world into destruction." This perception stems, in my opinion, from the mystery shrouding AI concepts. Therefore, it’s crucial to demystify AI to reduce fear and harness its potential for building useful solutions.

Data Science vs Generative AI

First, let’s distinguish data science from generative AI.

  • Data scientists are like airplane manufacturers—they build AI models by studying their inner workings and developing algorithms.

  • Generative AI developers are like pilots—they use pre-trained models without worrying about how they were trained.

A GPT model undergoes two phases:

  1. Training phase (where data scientists build the model).

  2. Inference phase (where generative AI developers use the model).

Just as pilots fly planes built by engineers, effectively using AI models built by data scientists is a skill in itself.

What Jargon are we going to demystify?

This article covers the following terms:

  • Transformers

  • Knowledge cutoff

  • Input query

  • Encoder & Decoder

  • Vectors & Embedding

  • Positional Encoding

  • Semantic meaning

  • Self & Multi-headed Attention

  • Softmax & Temperature

  • Tokenization

  • Vocab size

These concepts provide a solid foundation for your Generative AI journey.

Transformers

Did you know GPT in "ChatGPT" stands for Generative Pre-trained Transformer?

A transformer is an AI model that predicts the next word (or "token", as we’ll see later) in a user’s input. Think of it as a block of code that mathematically "guesses" the next word based on its training data. Examples include ChatGPT, Gemini, Llama2, etc.

Google engineers introduced the concept of transformers back in 2017 via the paper "Attention Is All You Need" to enhance Google Translate. The diagram above (from the paper) illustrates how transformers work.

Knowledge Cutoff

If you query ChatGPT -

“What is the weather in Delhi today?”

it will return -

I can’t provide real weather updates.

Unplug Images – Browse 61,962 Stock Photos, Vectors, and Video | Adobe Stock

This happens due to knowledge cutoff—the date up to which the AI’s training data extends. If an event occurs after this date, the model does not know about it.

Why? GPT models are trained using a method called - ‘Back propagation’. It refers to running the model with a particular input until an expected output is returned. Back propagation is a costly and time consuming affair and therefore it cannot be trained with live data.

But then how does ChatGPT sometimes fetch live data? Recent updates allow models to search externally (e.g., via API calls) and inject real-time context into responses - this is called context injection. Custom context injection is a deeper topic (to be covered later).

Tokenization

Your input (or query) is converted into tokens through tokenization, where text is split into smaller, AI-processable pieces.

Example:

  • "Hello World"["Hello", "World"]

Since GPT models work with numbers, tokens are mapped to numerical representations. Tools like OpenAI’s tiktokenizer can be used to visualise this process (see diagram above).

Let’s create a simple tokenizer class with just two methods - encode (to encode user inputs) and decode (to decode the encoded inputs) - in python to drive this learning home:

class Tokenizer:
    def __init__(self):
        # add alphabets that you want to encode and decode
        self.alphabets = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '
        # mapping alphabets to numbers for encoder
        self.char_to_index = {char: index + 1 for index, char in enumerate(self.alphabets)}
        # mapping digits to alphabets for decoder
        self.index_to_char = {index + 1: char for index, char in enumerate(self.alphabets)}

    def encode(self, text):
        tokens = []
        # look for character in the encoder array
        for char in text:
            if char in self.char_to_index:
                # collect the numbers for each characters
                tokens.append(self.char_to_index[char])
        # return the encoded numbers
        return tokens

    def decode(self, tokens):
        text = ""
        # look for number in the decoder array
        for token in tokens:
            if token in self.index_to_char:
                # collect the characters for each numbers
                text += self.index_to_char[token]
        # return the decoded characters
        return text

tokenizer = Tokenizer()
text = "Hello World"
encoded_text = tokenizer.encode(text)
print(f"Encoded: {encoded_text}")  # Output: Encoded: [8, 5, 12, 12, 15, 23, 15, 18, 12, 4]
decoded_text = tokenizer.decode(encoded_text)
print(f"Decoded: {decoded_text}")  # Output: Decoded: hello world

Vocab Size

Vocabulary size (vocab size) is the total unique tokens a model recognizes.

  • ChatGPT’s vocab size: 200,019 (as of writing).

  • Each token is represented by a vector (numerical array) of ~128,000 elements, called an embedding (explained later).

Encoder & Decoder

Remember the Caesar Cipher? It shifts letters by a pre-determined fixed number called ‘shift’ (e.g., A→C, M→O). Julius Caesar used this to encode messages and securely communicate with his generals.

How to make and Use Cipher Wheel - STEM Little Explorers

Let’s use the Caesar Cipher analogy to understand encoder and decoder.

  • Encoder: Converts input text into numerical vectors (like encoding a message).

  • Decoder: Converts vectors back into human-readable text (like decoding a cipher).

Each company uses its own "shift" (encoding method) for its models.

Vector Embedding

Vector is a mathematical concept. It is a numerical representation of data (text, images, etc.).

Vector Algebra:

Embedding is the process of retrieving semantic meaning out of the tokenized used input using custom algorithms.

Example:

  • Input: "King is to Man as Queen is to ____."

  • The AI plots vectors for "King," "Man," and "Queen" in a 3-D space (see diagram).

  • By calculating distances, it predicts the next word: "Woman."

This is how embeddings capture relationships between words.

Positional Encoding

Words can have different meanings based on position or order of the words in a sentence. Positional encoding ensures the model understands word order by assigning tokens based on position.

Transformers Positional Encodings Explained | by João Lages | Towards AI

Tokens will be same for any number of occurrences of a word in a sentence - as tokens are constants - but based on order, they can have different meaning. Therefore based on the positioning of the word in a sentence same word can be assigned to different tokens in order to process it better.

Self & Multi-headed attention mechanism

Self-attention in GPT is an important concept used in natural language processing, that allows a model to focus on specific parts of a sentence or text while processing it. It refers to the mechanism by which the positional encoded vector embedded tokens can ‘talk to each other’ to re-adjust their embedding based on the inferred perspective of the tokens.

No alt text provided for this image

Self-attention lets a model focus on relevant parts of a sentence.

  • Single-head attention: One perspective.

  • Multi-head attention: Multiple perspectives analyzed in parallel.

This mechanism helps transformers refine embeddings based on context.

Linear & Soft-Max Functions

Linear function refers to the probablity of outputs generated by the model and Softmax function helps to pick one output based on the probablity generated by the linear function.

  • Linear function: Generates output probabilities.

  • Softmax function: Picks the final output.

Some models label this as "Temperature" or "Creativity":

  • High temperature = More randomness (creative outputs).

  • Low temperature = More predictable outputs.

Let’s an example to understand this better:

INPUT QUERY -

Hello

Possible OUTPUTS -

OUTPUTSPROBABILITY
Hi, how can I help you?0.95
Hello there, what brings you here today?0.03
Top of the morning to you! Its a beautiful day isn’t it?! So what can I do for you today?0.02

If temperature = 1, it picks the highest-probability response (95%). Lowering temperature increases creativity by allowing lower-probability responses.

Conclusion

We’ve explored how GPT models process inputs through:

  1. Tokenization

  2. Vector Embedding

  3. Positional Encoding

  4. Self-Attention

  5. Output Generation

By demystifying these terms, I hope AI feels more approachable and less intimidating.

Reference

  1. GenAI Cohort by Hitesh Choudhary and Piyush Garg (use coupon code - MISHAL53139 - for discounts)

  2. LinkedIn Article by Vijayarajan

1
Subscribe to my newsletter

Read articles from Mishal Alexander directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mishal Alexander
Mishal Alexander