Decoding Some Basic AI Jargon With Chai

Vidit VatsVidit Vats
3 min read

Vectors: -

  • When we deal with unstructured data, it has no descriptive attributes i.e. we cannot define the properties that we can store.

  • In order to get rid of the previous behavior, we make Vector Transformation of the data.

  • It is generally represented in the form of a 1-D Array.

Transformers: -

  • Transformers belong to a type of “Deep Learning Architecture” that consists of 2 phases mainly called Encoding and Decoding Phase.

  • The term was first quoted in the Research Paper titled “Attention Is All You Need” published by Google in 2017.

Encoder: -

  • This unit is responsible for initiating the “Tokenisation” Phase. Basically, it is responsible for breaking the input query into the minimum possible unit of data called tokens.

  • Consider a simple query: What is your name
    There are 4 words in this input query, Therefore we have 4 tokens here.
    Keep in mind that only the count of unique tokens is maintained.

Decoder: -

  • Decoder is a unit that generates output one-word at a time.

  • It remembers the previous state and guesses the desired state by trial and error.

Embeddings: -

  • The Tokenisation process is done to generate the Vector Embeddings. Embeddings are how we give meaning to numbers in a way that models can understand.

  • Things that are similar are grouped closely i.e. tightly coupled and dissimilar things are placed far i.e loosely coupled.

Positional Encoding: -

  • Consider two examples: -

    The River Bank
    The ABC Bank

  • We can see that there are same number of tokens and similar tokens, but semantically they differ.

  • Therefore, to resolve this problem, we assign Positional Encoding to each sentence to differentiate b/w their positions.

Self-Attention: -

  • This means that token are given a chance to communicate with each one of them.

Multi-Head Attention: -

  • It simply means “Thinking about multiple possibilities parallely”.

Temperature: -

  • It is associated with “randomness” parameter. More is the value of temperature, more will be the randomness / creativity in the LLM’s response.

  • If temperature value is less, more precise response will be given as output.

Knowledge Cutoff: -

  • It refers to the time period / time frame up-to which the model was trained.

Vocab Size: -

  • It refers to total number of unique tokens the tokeniser can generate.

Softmax: -

  • This phase is responsible for picking the result that has highest probability.

Simple Implementation of the Tokenizer in Python: -

class Tokenizer:
    def __init__(self,string):
        self.string = string
        self.encoded_text = []
        self.all_letters = []

    def encode(self):
        swar = ['अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ए', 'ऐ', 'ओ', 'औ', 'अं', 'अः']
        vyanjan = [
            'क', 'ख', 'ग', 'घ', 'ङ',
            'च', 'छ', 'ज', 'झ', 'ञ',
            'ट', 'ठ', 'ड', 'ढ', 'ण',
            'त', 'थ', 'द', 'ध', 'न',
            'प', 'फ', 'ब', 'भ', 'म',
            'य', 'र', 'ल', 'व',
            'श', 'ष', 'स', 'ह',
            'क्ष', 'त्र', 'ज्ञ'
        ]
        matra = ['ा', 'ि', 'ी', 'ु', 'ू', 'े', 'ै', 'ो', 'ौ', 'ं', 'ः', 'ँ', '़', 'ृ', 'ॅ', 'ॉ']

        swars = {k : [ord(ch) for ch in k] for k in swar}
        matras = {k : [ord(ch) for ch in k] for k in matra}
        vyanjans = {k : [ord(ch) for ch in k] for k in vyanjan}
        capital_letters = {chr(k) : k for k in range(65,91)}
        small_letters = {chr(k) : k for k in range(97,123)}

        self.all_letters.append(swars)
        self.all_letters.append(matras)
        self.all_letters.append(vyanjans)
        self.all_letters.append(small_letters)
        self.all_letters.append(capital_letters)

        print(self.all_letters)

        for let in self.string:
            i = 0
            while i < len(self.all_letters):
                if let in self.all_letters[i]:
                    self.encoded_text.append(self.all_letters[i].get(let))
                i += 1

        print("Encoded Text: ",self.encoded_text)

    def decode(self):
        for let_val in self.encoded_text:
            found = False
            for category in self.all_letters:
                for char, value in category.items():
                    if value == let_val:
                        print(char)
                        found = True
                        break

                if found:
                    break

token_obj = Tokenizer("नमस्ते")
# token_obj = Tokenizer("RAHUL")

token_obj.encode()
token_obj.decode()
1
Subscribe to my newsletter

Read articles from Vidit Vats directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vidit Vats
Vidit Vats