Decoding AI Gergons with a cup of Chai


Introduction
Let me ask you a question first — have you ever wondered how Spotify knows and suggests the next song you might like? How do platforms like Netflix, YouTube, and Instagram curate and show you content based on your preferences?
In your mind, you might be saying — “Well, Shubham they are of course using AI and I have seen it on Social Dilemma (a Netflix documentary)”. And I would say, Yes you are right, but you will be shocked to know — Despite the advanced capabilities of AI, they do not understand the real world like we do. They do not understand the real nuance of a song, video or news article. Now, the question you might be asking then how do they understand us? How do they answer my mathematical or coding questions?
So let’s grab a cup of tea or coffee (or whichever drink you like) and explore the answer to the question.
Embeddings & Tokenization
To quench your thrust of curiosity, the magic behind this capability involves a blend of algorithms, AI models and a huge amount of data. Yeah, and that’s it.😄
But a large portion of that answer involves “Embeddings”. Now the question is — What is this embeddings? We will come to that. For now, think of it like this — when we present a question to an AI, first it needs to translate it into a format that it can understand. So, we can think of embedding as the language that AI understands.
From this point, we are going to talk about LLMs(Large Language Models) as a reference for our exploration of the main question. Rest assured the other AI models also do something sort of the same. Now if you were not living in the Himalayas or deep under the ocean for the last 2 years, you must have heard LLMs like ChatGPT, Bard, Claude, Gork and many more. The basic understanding of this AI model is that if you ask a question, it will try to give some sort of correct answer that you might expect. Let’s understand what happens behind the scenes —
When we ask a question(in our case we are considering the language to be English), first the model tries to break the question into small parts and we can think of this like small tokens. It could be a single word, part of a big word(like un-believe-able) or a single character/punctuation(“ “,).
Despite my question having 6 words, the model has split the question into 8 tokens. Now the real question is why does the model need to do this? The answer to the question is that the model converts the tokens into numbers for referencing. You can see the model tokeninzing the same question in numbers.
The total number of unique tokens a model can recognize and understand, we call the model’s vocab size. While models like GPT 3 and 3.5 have a vocab size of 50,257 — it is estimated that GPT 4 has a vocab size of 1,00,000 or fewer depending on the tokenizer version and settings(though no official document has mentioned that).
Now converting and representing them in just a few numbers has no use to the models and here comes Embedding. The term Embedding is a mathematical concept that refers to placing one object into a different space. Think of it like taking a word that is in a content space and after transformation representing it in a vector space(2d, 3d or multi-dimensions), all while preserving its original meaning and relationships between the other words.
Positional Encoding and Attention Methods
It’s the process of embedding that gives the tokenized numbers a semantic meaning and in vector space, they are linked with one another. That’s why it is also called Positional Encoding. The term Self-attention refers to the process of giving power to the tokens that they can talk between themselves to adjust their position. Self-attention finds a relationship with a sequence. There is also Multihead-attention which runs in parallel with different focus points to get different aspects/perspectives of the token. As in —
The Kingfisher bird… (here the Kingfisher referring to the bird to complete the sentence)
The Kingfisher airline… (here the word referring to a company named Kingfisher)
Now Let’s take an example to solidify our understanding. In a two-dimensional graph let’s say a word like “king” translates to a point (5, 2.5). This encapsulates the meaning and nuances of the word king in a way that the AI model can understand. Now think of another point (3.5, 1.5) which refers to the word “prince”. Now words that have similar meaning are numerically similar and tend to be closely positioned in the vector space.
But the word “Queen” is a different word but closely aligned with the word king, so the word might be represented by a vector that’s not too distant. And when we ask questions like “If Price becomes king, then who becomes queen?”, the model can refer to the same distance to get the answer which is “princess”.
This is one of the examples of how embeddings might capture semantic meaning and relationships between words. Now a two-dimensional graph is a massive simplification as real word embedding s often exists in much higher dimensional spanning across hundreds and thousands of dimensions. Each dimension or number in a vector might capture a different semantic or contextual aspect of a word. This is what allows the AI to recognize and differentiate between the contexts when a word is used in different scenarios.
Embedding in Code Example:
Now Let’s see how embedding is done using Python programming language:
#(this code considers that you have already coneected api key)
# first we are importing openai
import OpenAI from "openai"
const openai = new OpenAI()
# now we are constructing embedding response:
const response = await openai.embeddings.create({
# we have to tell what type of embedding with model name
model: "text-embedding-3-small",
input: "a quick brown fox jump over the lazy fox",
# we can also metion encoding fomat but it defaults to float"
encoding_format: "float"
})
print(response.data[0].embedding)
# this will give us a bunch of random negetive floating number:
'''
-0.006929283495992422,
-0.005336422007530928,
-4.547132266452536e-05,
-0.024047505110502243
'''
Transformers
In 2017 google launched a paper called “Attention is all you need
” where they introduced a game-changing architecture for sequence modelling and mentioned about Transformers. Now let’s talk about it a bit:
Before Transformers sequential processing (meaning understanding a sentence or constructing a sentence in reply) was very slow and forgot the old world in a long sentence. Google’s paper “Attention is All You Need” removed recurrence and introduced a fully attention-based architecture called Transformer.
The transformer consists of two main parts:
Encoder
Decoder
Function of Decoders
The primary focus was on multi-head attention over encoder output.
—> So first the encoder takes our input
—> then converts each word into a vector(positional encoding) by the embedding process
—> Each token learns to look at each other using self-attention and multi-head attention.
— this result in a set of smart vectors that knows the full meaning of a sentence.
Now the decoder’s parts get active. It works like a smart text predictor:
—> Starts with nothing, then guesses the first word
—> Then uses previous guesses and encoder’s info to guess the next word
—> Then the process keeps repeating until the sentence gets completed or irritated over to get the best result.
Let’s look encoder decoder code example:
import tiktoken
encoder = tiktoken.encoding_for_model('gpt-4o')
print("Vocab Size", encoder.n_vocab) # 2,00,019 (200K)
text = "The cat sat on the mat"
tokens = encoder.encode(text)
print("Tokens", tokens) # Tokens [976, 9059, 10139, 402, 290, 2450]
my_tokens = [976, 9059, 10139, 402, 290, 2450]
decoded = encoder.decode([976, 9059, 10139, 402, 290, 2450])
print("Decoded", decoded)
Softmax and Temperatue
Each guess is a big vector, softmax turns it into probabilities and picks the word with the highest chance. The softmax is directly connected with the temperature of the response. Let’s look at it in a code example:
#(this code considers that you have already coneected api key)
import openai
const openai = new OpenAI()
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo" # we have to metion AI model name
message=[
{"role": "system", "content": "You a creative blog writer"},
{"role": "user", "content": "Write a blog on how leaning html is enough for getting a job."}
],
temperature = 0.9 # this controls creativity and value goes 0-2
max_tokens = 200 # we are emposing a token limit, within that limit the model has to responde
)
print(resonse["choices][0]["message"]["content"]
Changing the temperature will change the creativity of the blog. ( 0.2 → short, focused and safe response; 0.9 → more imaginative and creative; 1.5 → wild and sometimes weird)
Well, the limitation is not limited here. These LLM Models can not give a real-time answer. These LLMs have been trained on a data till certain date but after that their knowledge has been cut off. (Chatgpt-4 has a cutoff date of April 2023). Though we can provide it with real-time data and based on that it can do its magic. But that part is reserved for RAG(Retrieval Augmented Generation) and we will discuss that in another blog.
Footer Note
I have tried to write blogs in past but failed miserably. This time it’s my solid attempt and could not get excited enough to get your feedback. One thing I want to mention is that this blog is written from an AI Engineering perspective and not from a pure mathematics or ML Prospect. So I urge you to consider this while giving feedback. Any mistakes that you see are from my side and any good/better knowledge you get will be considered towards my mentors Hitesh Choudhary & Piyush Garg teaching. With that note have a Good Day 👋🏻
Subscribe to my newsletter
Read articles from Shubham Biswas directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
