JARGONS, seems like a fancy word itself but is the word for the fancy terms we use in the tech field, more generally, in the professional fields. These jargons when not understood properly, make the AI seem like it’s creating things magically. (Ps: which is not the case at all)

The following topics will be covered in this article:

Things you need to know about GPT
How would AI not take away the jobs of humans?
Input and Tokenization process
Vector embedding, semantic meaning
Positional Embedding
Self attention, multi head attention
Training phase and Inferencing phase
Temperature, knowledge cutoff

Things you need to know about GPT

What exactly is an AI model? In very simple words, AI model is a trained machine that gives human like answers to your questions. It has all the data you need for the course of study on which you are asking questions to it, unless you are asking it questions based on very recent events.
The thing with having no data or minimum data for the current events occurring in the course of study has everything to do with the training of AI models.
If we take an example, let’s say the all time favourite, ChatGPT, one of the AI models that has answered all of our silly as well as logical questions, when asked about the recent news of a country or even an area of the country, won’t be able to answer it. And why’s that?
So, GPT - Generative Pre-Trained Transformer, as suggested in the name is a trained model, similar to all other AI models.
Training of these transformers means to feed data to it and to train it on how to answer the user’s request.
The thing with training transformers is, it is expensive AND complex.
AI models, unlike human, go through so much of trial and error that they basically memorize what it has to answer and just levels up the creativity based on the temperature set by the user or the default temperature.

How would AI not take away the jobs of humans?

The talk about AI taking over the jobs of humans is a myth, at least for now.
The reason being, all these jargons we are about to look at, that helps the transformers to be trained.
These trainings are one tough of a job, and if no human trains the AI model, how will it be ever able to answer to our questions like it does now?
Without the data, without the proper forming of a sentence, without the knowledge about the course of study?
So AI is not taking jobs.
Below are the points I learned in the Gen AI cohort about the working of an AI model.

Input and Tokenization process

Transformer, being a human made machine, does not understand human language.
So, to get those human like responses from it, it needs to be trained, firstly, on how to take the input and encode it.
What it does is, it takes the input provided by the user, converts it or we can say encodes it into tokens.
Tokens are a series of number representing words of a sentence
For example:

[‘The‘, ‘cat’, ‘sat’, ‘on’, ‘the’, ‘mat’] ---> [2, 1175, 4401, 3173, 611, 573]
AI models have a vocab size - number of unique words which can be tokenized - through which it gives tokens to the words
Higher the vocab size, slower the response time but more accuracy as it can have rare words be tokenized as well.

Vector embedding and Semantic meaning

Vector embedding is nothing but the relation between which the transformer gets to the actual word.
For example: Consider the words “Paris” and “France”, after this context when the word given is “Tokyo” people tend to think “Japan”. Just like this dots are placed on a 3D graph that helps get to the next word or predict the next word
These vector embeddings capture the semantic meaning of the words.
Meaning that they capture how they relate to same group of class.
For example: “Cat”, "Dog” and “Banana”

Here, cat and dog are in the similar group of class - Animals, while banana is a fruit so it would not be captured by the vector embeddings.

Positional Embedding

When we create the tokens, positional embedding is used to know which token goes at which position.
If we don’t use positional embedding, transformer would not change the position of the tokens and might give incorrect response
For example: “The cat sat on the mat” and “The mat sat on the cat” are both different sentences which carries different meaning.
Without positional embedding, transformer would think they are the same sentences and would provide wrong response.

Self Attention and Multi Head Attention

Self Attention lets the tokens talk to each other, which is actually very essential.
In simple words it lets the word know the environment (sentence) in which it is being used.
We may think that providing vector embedding and positional embedding creates the response we need, but the tokens need to talk to each other as well.
This can be understood by the example, “The river bank” and “The ICICI bank”.
Both of these sentences are pretty similar at the first look but bear whole different meanings
If the tokens don’t talk to each other, bank would not know if it is the edge of the river or a self institute. For the bank to know what meaning it bears in that particular sentence, it needs to know where it is being used.
This is achieved by self attention.
Now sometimes, the sentence may be a bit long and a bit confusing for the transformer to determine the context it is referring.
Thus, we introduce it to multi head attention, which means it looks at the word from more than perspective not just one.
Taking the same example - “The cat sat on the mat” - we can understand that in multi head attention, one head focuses on the relationship between cat and sat, while another head looks at the relationship between cat and mat.
This helps build the perfect sentence required for the response.
A tat bit of info which can be included in this is Softmax; Softmax basically enables the transformers to weigh the tokens based on the context of the current statement.
Example: “He ate pizza with cheese” - here let’s say transformer needs to understand what cheese means in this particular statement. Looking at the statement which word provides the easiest context to understand?
In this case pizza, so what the transformer does is give each of the words a score and the one word with the highest score is used to understand what the word of study means here.

Training phase and Inferencing phase

Training phase is where the actual formation of sentence to be responded takes place.
Here in this phase the transformer takes the user input and checks which character of the alphabet has the best probability for an eligible answer for the input.
Let’s also understand this with an example: Input is “How are you?”, In most cases the answer expected is “I am fine” .
What the transformer is made to do in training phase is it is given an input and then it checks which letter of the alphabet has the most probability to start the answer with.
It checks for letters like I, C, R, S, U, V and gives it all a score.
The letter with the highest score is selected.
Let’s say “I” had the probability of 90% and it got chosen as the first letter, then for the second word(ignoring space here), letters like e, r, c, a, s are checked one by one and given a score after each check.
This prediction made by the transformer for what letter to choose next is done in the inferencing phase.
It is basically a loop, check for a letter, give it a score check for another give it score, perform this for multiple letters, when found the correct one, start for the next letter and continue.
Just like this after going through a lot of trial and errors, it finally comes at the full statement “I am fine” which is then used as the output.
This process is used for all possible outputs in all possible fields of data the transformer is provided with.
Thus, we come to the conclusion that training these transformers is way difficult and therefore is done once a year or so depending on the company.

Temperature and Knowledge cutoff

Temperature in simple words is the creativity we want in the response of the transformer.
Some models do provide the temperature settings so that we can have as creative answer as possible.
Creative answer basically means going through more trial and errors to not give the same boring outputs, and thus might be a bit slower.
We can say that as temperature increases the randomness of the response increases and the accuracy might decrease.
Knowledge cutoff is nothing but the time limit of the input’s occurrence after which the transformer can no longer answer.
In very simple words, if a transformer is trained and updated with the data of august 2024, when asked anything about an event that occurred after the given time period, might not be able to answer.
As we already learned before that when asked about current affairs of a country, transformer is not able to answer it and same goes for today’s weather or new technology.

Learning about all the jargons of AI was fun and now I know how actually ChatGPT works and don’t think that it magically knows answers to all my questions. This session made me look at AI models in a whole different perspective, which I didn’t think would be this fun to know.

Now I can talk to ChatGPT without wondering how it knows all of the things I ask it.

Decoding AI Jargons with chai