Decoding Some Basic AI Jargon With Chai

Vectors: -
When we deal with unstructured data, it has no descriptive attributes i.e. we cannot define the properties that we can store.
In order to get rid of the previous behavior, we make Vector Transformation of the data.
It is generally represented in the form of a 1-D Array.
Transformers: -
Transformers belong to a type of “Deep Learning Architecture” that consists of 2 phases mainly called Encoding and Decoding Phase.
The term was first quoted in the Research Paper titled “Attention Is All You Need” published by Google in 2017.
Encoder: -
This unit is responsible for initiating the “Tokenisation” Phase. Basically, it is responsible for breaking the input query into the minimum possible unit of data called tokens.
Consider a simple query: What is your name
There are 4 words in this input query, Therefore we have 4 tokens here.
Keep in mind that only the count of unique tokens is maintained.
Decoder: -
Decoder is a unit that generates output one-word at a time.
It remembers the previous state and guesses the desired state by trial and error.
Embeddings: -
The Tokenisation process is done to generate the Vector Embeddings. Embeddings are how we give meaning to numbers in a way that models can understand.
Things that are similar are grouped closely i.e. tightly coupled and dissimilar things are placed far i.e loosely coupled.
Positional Encoding: -
Consider two examples: -
The River Bank
The ABC BankWe can see that there are same number of tokens and similar tokens, but semantically they differ.
Therefore, to resolve this problem, we assign Positional Encoding to each sentence to differentiate b/w their positions.
Self-Attention: -
- This means that token are given a chance to communicate with each one of them.
Multi-Head Attention: -
- It simply means “Thinking about multiple possibilities parallely”.
Temperature: -
It is associated with “randomness” parameter. More is the value of temperature, more will be the randomness / creativity in the LLM’s response.
If temperature value is less, more precise response will be given as output.
Knowledge Cutoff: -
- It refers to the time period / time frame up-to which the model was trained.
Vocab Size: -
- It refers to total number of unique tokens the tokeniser can generate.
Softmax: -
- This phase is responsible for picking the result that has highest probability.
Simple Implementation of the Tokenizer in Python: -
class Tokenizer:
def __init__(self,string):
self.string = string
self.encoded_text = []
self.all_letters = []
def encode(self):
swar = ['अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ए', 'ऐ', 'ओ', 'औ', 'अं', 'अः']
vyanjan = [
'क', 'ख', 'ग', 'घ', 'ङ',
'च', 'छ', 'ज', 'झ', 'ञ',
'ट', 'ठ', 'ड', 'ढ', 'ण',
'त', 'थ', 'द', 'ध', 'न',
'प', 'फ', 'ब', 'भ', 'म',
'य', 'र', 'ल', 'व',
'श', 'ष', 'स', 'ह',
'क्ष', 'त्र', 'ज्ञ'
]
matra = ['ा', 'ि', 'ी', 'ु', 'ू', 'े', 'ै', 'ो', 'ौ', 'ं', 'ः', 'ँ', '़', 'ृ', 'ॅ', 'ॉ']
swars = {k : [ord(ch) for ch in k] for k in swar}
matras = {k : [ord(ch) for ch in k] for k in matra}
vyanjans = {k : [ord(ch) for ch in k] for k in vyanjan}
capital_letters = {chr(k) : k for k in range(65,91)}
small_letters = {chr(k) : k for k in range(97,123)}
self.all_letters.append(swars)
self.all_letters.append(matras)
self.all_letters.append(vyanjans)
self.all_letters.append(small_letters)
self.all_letters.append(capital_letters)
print(self.all_letters)
for let in self.string:
i = 0
while i < len(self.all_letters):
if let in self.all_letters[i]:
self.encoded_text.append(self.all_letters[i].get(let))
i += 1
print("Encoded Text: ",self.encoded_text)
def decode(self):
for let_val in self.encoded_text:
found = False
for category in self.all_letters:
for char, value in category.items():
if value == let_val:
print(char)
found = True
break
if found:
break
token_obj = Tokenizer("नमस्ते")
# token_obj = Tokenizer("RAHUL")
token_obj.encode()
token_obj.decode()
Subscribe to my newsletter
Read articles from Vidit Vats directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
