GenAI Explained: A Brief Overview

Table of Contents -

Why Gen AI?
Researchers, ML Engineers & Developers — Who Does What?
GPT — What’s With the Fancy Name?
“Attention Is All You Need” — The Paper That Started It All
Tokens & Tokenization — The Lego Bricks of Language
The Vocabulary — The AI’s Secret Dictionary
Vector Embeddings
Positional Encoding — ‘Dog Chases Cat’ ≠ ‘Cat Chases Dog’
codes

this is just like TL;DR of GenAI

1. Why GenAI ?

simple .. It’s a trending technology, so learning it gives you an edge..
The core idea behind Gen AI is simple: predicting the next word
it is a smart technology that can create things like writing stories, making pictures, or answering questions just by learning from examples.

2. Researchers, ML Engineers & Developers — Who Does What?

Researchers and ML engineers work on the core of Gen AI — which involves a lot of math.🥲, They build the models and handle all those mathematical operations needed to make them work.
But developers..? We don’t really need deep mathematical knowledge to build applications on top of these models. We can simply use the model’s functionalities without worrying about the complex math behind it.
So no worries .. as a developer, you can keep reading this blog 👍.

3. GPT — What’s With the Fancy Name?

GPT — (Generative Pre‑trained Transformer)
They named ChatGPT with its own name… kinda like naming a car brand just “Car.”
Generative — It creates new data, not just gives you resource links like Google.
Pre‑trained — Already trained on a huge amount of data (text data, in this case).
Transformer — It predicts the next token and then feeds it back into the model along with the newly predicted token.

4. ‘ Attention Is All You Need ‘

💡

there is a famous paper which was published in year 2017, it is where all the this ai thing started

Attention Is All You Need paper -- If you’re curious about the nitty‑gritty details and teachy definitions 🫡, check out this paper.

Also, take a look at this architecture diagram

5. Tokens & Tokenization

Tokens are tiny building blocks of text. Could be words, parts of words, or punctuation.
Tokenization is the process of splitting text into tokens. Helps language models understand and process input efficiently.
for better understanding go to this website called tiktokenizer . you can see the tokenization process

6. The Vocabulary — The AI’s Secret Dictionary

The process of tokenization isn’t random btw…☝️ The way words are split is based on a predefined dataset, which is generally called the vocabulary for that model.
The vocabulary of an LLM refers to the complete set of unique tokens that the model is trained to recognize and process.
Each unique token in the vocabulary is assigned a unique numerical identifier (an index).
These tokens are then converted into numerical representations called embeddings,

Examples of vocabulary size in different LLMs

Mistral/Llama: Around 32,000 tokens.

GPT-4: 100,000 tokens.

Google Gemma: 256,000 tokens.

7. Vector Embeddings

The numerical representation of tokens is called vector embeddings. You can check the above TikTokenizer image to see what these embeddings actually look like..
Embedding visualizer Have a look at this — you’ll see how words are related to each other in a multi‑dimensional space.
By default, the length of the embedding vector is 1536 for text-embedding-3-small or 3072 for text-embedding-3-large
Dimentions range 1536 - 3072 .

8. Positional Encoding

‘Dog chases cat’ ≠ ‘Cat chases dog’ — This makes perfect sense to us, because the meaning completely changes when we switch the positions of the words.
Similarly, for a model, embeddings alone aren’t enough to capture the full meaning. Along with the embeddings, we also need to provide the positions of the tokens — so it knows who’s chasing whom.

9. Codes

Tokens

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

text = "Hello, I am Piyush Garg"
tokens = enc.encode(text)

print("Tokens:", tokens) 
#tokens will be printed here
#----------------------------------------

tokens = [13225, 11, 357, 939, 398, 3403, 1776, 170676]
decoded = enc.decode(tokens)

print("Decoded Text:", decoded)
#tokens decoded will be printed here
#----------------------------------------

Vector-Embeddings

from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

client = OpenAI()

text = "dog chases cat"

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=text
)

print("Vector Embeddings", response)
print(len(response.data[0].embedding))

Understanding GenAI: A Quick Overview

Table of contents