The GPT in that ChatGPT

The hype over the past few years in the field of AI has had even the common man confused on what all the fuss is about . It feels that every day there is a new term thrown into the world that we are all supposed to know and catch up with and so to have some commonly used terms in AI simplified would save my brain some buffering.

One such term is GPT , short for Generative Pre-trained Transformer . The fancy words hide a simple algorithm built by mashing together multiple small algorithms or snippets of code.

The following is the diagram of Transformer model architecture outlined in the Google Research Paper - Attention is all you need - https://arxiv.org/pdf/1706.03762

Tokenization

Tokenisation is a fundamental step in Natural Language Processing , which is basically a process of splitting down text into smaller units called tokens.

This is nothing new and has been used in cryptography for sending messages secretly for decades if not centuries.

I have created a script that shows tokenization using the OpenAIs tiktoken library and also custom token script which does the same encoding using unicode -

https://github.com/ashwinpalve/tokentest/tree/main

Here are a few snippets using unicode which would be similar to what the tiktoken library is doing for OpenAI .

import unicodedata

def hindi_tokenizer(text):
    """Basic Tokenizer for splitting text into unicode points"""
    normalized=unicodedata.normalize('NFC',text)
    return [ord(char) for char in normalized]

hindi_text = "यह एक हिंदी भाषा का उदाहरण है।" #Can be updated to take hindi text input from user 
hindi_tokens = hindi_tokenizer(hindi_text)
print("\nHindi text tokens\n", hindi_tokens)

The script has examples for English and Hindi but can also be used for other languages like Japanese , Marathi , Tamil , etc.

Positional Encoding and Vector Embedding

Now so far as Tokenisation , things worked fine with just encoding and decoding in 2d space , but to get the semantic meaning of the words and their relationships with each other , these meanings had to be mapped as well , and computers are able to do this in a high dimensional space with numbers and direction(vectors).

This mapping/representation of semantic meaning of words is called Vector Embedding.

Positional Encoding adjusts the tokens for positional context of words, since all languages have a flow or sequence of words and that sequence is an important part for it to have meaning.

Therefore it could be that even though the vector embeddings of the tokens in the same , after positional encoding , those vector embeddings will change which means in short , the semantic meaning of word positions in a sentence was added .

Self Attention

Self Attention , this is the Attention that Google folks are going on and on about in their paper Attention is all you need . Well let’s see what this is now .

Self Attention is a mechanism(it seems) to understand relationships between different parts of the same input.

Example:
In the sentence "The cat sat on the mat", when focusing on "cat":

The model might pay more attention to "sat" and "mat" than "the" or "on," as they are more semantically relevant to "cat."

Simple enough , but the complexity might arise when the focus has to be on different subjects and objects within a sentence , which would add to the semantic meaning.

Multiple Head Attention

Multi-head attention is an extension of self-attention that improves its effectiveness by allowing the model to focus on different aspects of the input simultaneously.

Instead of calculating attention once, multi-head attention splits the input embeddings into smaller parts and applies self-attention multiple times (in parallel).

Each head focuses on different relationships or features within the input.

At this point the tokens are exposed to each other and start adjusting their vector embeddings for semantic meanings

Multi-head attention allows the model to capture diverse relationships within the data.
For example, in a sentence, one head might focus on grammatical structure while another focuses on semantic meaning.

Toggling with Temperature

Most of the models available online have a temperature feature or a toggle feature that allows the model to converge or diverge from the semantic meaning depending on the context and creative chaos needed.

The Linear and Softmax blocks in the diagram are what basically do this. In a nutshell they control how much creative the model can get or sometimes even noisy. This feature can be most clearly observed in image generation models .

Hallucinations: Attention is NOT all you need clearly

After understanding these complex words I can see that it’s just mathematics nested within mathematics trying to capture the diversity of language and meaning.

I am sure these models would get better and better by the day but with these simple experiments I could see why hallucinations would occur in these models .

Hallucinations in AI models occur when the system generates outputs that are factually incorrect or nonsensical, often due to gaps in its training data or limitations in its architecture.

Did you know Google's AI summary tool suggested that astronauts had met and played with catson the moon during the Apollo 11 mission?
ChatGPT also landed in hot waters when it provided made-up legal citations to a New York attorney for a court case. The attorney was representing a client in an injury case, so deciding to depend on AI’s responses was a huge mistake. He submitted a GPT-written brief to the court, including citations and quotes from several supposed legal cases.
However, upon review, the judge discovered that these citations were entirely fictitious— the cases and quotes did not exist in any legal database.

Vector embeddings, self-attention, and multi-head attention mechanisms play significant roles in both contributing to and mitigating hallucinations.

AI Jargons: What all the fuss is about