AI is a prediction mechanism, which predicts sequentially to produce an output. In this article, we are going to understand some basic termonologies about the architecture and the working of AI.

As mentioned, it is nothing but a prediction mechanism that predicts based on its training done on the different forms of the data. The more the data it is trained on, more smarter it becomes.

Let’s start by understanding the most popular one Chat GPT. We all understand the word chat(We do a lot of it), so what is GPT then?
GPT stands for Generative Pre-trained Transformer.

Generative: So it is already pretty clear from the word. It does not get the information from somewhere, instead it generates it.
Pre-Trained: This means the model is already trained on some kind of data. So this signifies two things
- It will be better at performing the tasks on which it is trained.
- It will not have the latest information. It is also called Knowledge Cut-off (We will talk about it later).
Transformer: This is the most important and mysterious part of the whole term. A transformer follows an architecture that works on the predictions and returns an output.

Transformer

Transformers was first introduced in the white paper published by Google called Attention Is All You Need.
This article gives you a nice architecture that is followed by most AI models.

Transformer Architecture

Let’s start understanding some parts of it.

Tokens: So it starts with converting the words or characters into tokens. Every model has its own implementation of converting the characters or words to the tokens. These tokens are the entry point in this architecture.
Some libraries can help you build the token. One of the use cases can be monitoring your tokens to count before passing them to the model.
Below is a simple code that uses Tiktoken from OpenAI to generate the token for the mode/

import tiktoken

text = "What is AI" 
model = tiktoken.encoding_for_model("gpt-4")
tokens = model.encode(text)
print(tokens)

Embedings: To give the meaning to these tokens, Vector embedings are generated. These help in understanding the intent of the word rather than just understanding the word by generating semantics.
This article explains it in detail.

Positional encoding: It is used to provide a relative position for each token generated. They provide a relationship to the tokens. For eg, a Trunc can be related to both an Aliphant and a Tree.
this article mentions it in detail.

Self Attention: This is a process where it understand which word has more importance and according to it, a word can change the sementic of the upcoming word. This mechanism is used to weigh the importance of tokens or words in an input sequence to better understand the relations between them.

Multihead Attention: Attention runs multiple times in parallel. Every running attention is called a head. So attentions running in the parallel are called multihead attention.

Encoder: Its main purpose is to get the sentence and understand it context and provide the vector embeddings. Encoder is the combination of all the above parts.

Decoder. Its main purpose is to take the embeddings from the Encoder and generate the most predictable answer for it.

Softmax: This allows the model how creative it can get. As we know it works on the prediction, if our softmax is high it allows the model to sometimes choose less weighing character or words.

Vocab size: the vocabulary size of a language model refers to the number of unique words or tokens that it can understand and generate.

So to conclude, this is a 10000 fit view for the AI modes or for a Transformer. But it gets you started and give a introduction about some basic but important terms.

Understanding AI: Essential Terms for Beginners

Table of contents

Transformer

Subscribe to my newsletter

Shivank Mittal

Shivank Mittal