Tokenization

Tokenization is one of the key concepts used in various fields like AI, Web3, Cyber-Security, and many more. It is the act of masking letters, words, special characters or in general any sensitive information, with the real numbers. Each word, character is associated with a numeric value, which is used by different models, algorithms to understand the meaning of the characters used. In the context to AI, specifically LLMs, tokenization is a key part of text processing. Each generative model comprises a pre-defined set of characters or words, which are assigned a specific numeric value, and is referred while processing the text. For example: OpenAI’s GPT-4o has almost 200K vocab size, GPT-3.5 has almost 100K, 128K for Llama 3.

Vocab size refers to the number of tokens that a LLM can uniquely identify, generate with the help of associated numeric values with them.

#Short tour of tokenization using tiktoken
import tiktoken 
encoder = tiktoken.encoding_for_model('gpt-4o') #Here we set the encoder for gpt-4o 
text = "Hello there, I am Hridayesh!" #This is a sample text whose tokens we will get.
tokens = encoder.encode(text) #Tokens are returned as a list of integers
print("Tokens: ",tokens) 

#OUTPUT-> Tokens:  [13225, 1354, 11, 357, 939, 100933, 5002, 8382, 0]

Encoder

Encoders are algorithms which are used by every LLMs to understand prompts/instructions given to it, by changing each word or character to their respective tokens. Tokens for each word or character is unique to models. The LLM, then, understands the meaning of the words by use of tokens and goes for further processing.

Decoder

Decoders, like encoders, are crucial algorithms which helps LLMs to provide meaningful text outputs, by changing the tokens, of their output, to their respective word or characters.

To understand let’s use `tiktoken`

import tiktoken
encoder = tiktoken.encoding_for_model('gpt-4o') #We set the encoder for gpt-4o model
text = "This is an example to demonstrate encoding."
tokens = encoder.encode(text) #returns a list of tokens corresponding to each word or character
print("Encoded tokens: ", tokens)

#OUTPUT -> Encoded tokens:  [2500, 382, 448, 4994, 316, 28058, 24072, 13]

#Now let's decode the same sequence. We are using the same sequence of tokens to get the same text back
decoded_text = encoder.decode(tokens) #Returns a string of words and characters based on the list of tokens
print("Decoded text: ", decoded_text)

#OUTPUT -> Decoded text:  This is an example to demonstrate encoding.

Vector Embeddings

Vector Embeddings are a n-dimensional numerical representation of inter-related tokens of words, that capture the semantic meaning of a text to the machine learning model.

Let’s understand it with an example. During summer season, we all prefer cold desserts like Cold coffee and Ice cream. The way summer and [cold coffee, ice cream] are related, helps the model predict the same for desserts in winter season. [cold coffee, ice cream] is associated with summer through vectors of real numbers(can be integers or floating point) that help position them closely in the semantic space. [hot chocolate, hot coffee] are related to winter in a similar fashion, showcasing their semantic relation.

Vector Embeddings can be thought of as a GPS of vectors, where words of similar nature are neighbors to each other in the n-dimensional semantic space.

Visualization of Vector Embeddings

Positional encoding

Positional encoding is a technique used in transformers to add positional meaning and information of the words or tokens in sequence and convey them to the model. Transformers use positional encoding to preserve the order information of the word.

For example: “I want a story book.”, “I want to book a room.“. Let’s analyze this two sentences. In first sentence, the token “book“ means a physical quantity, whereas, in the second sentence, the token “book“ means the act of securing a room (in this scenario). To the model, the token “book“ has only one value, but the same token is used in different context. So, positional encoding adds the context to the token, and helps the model to accurately get the meaning.

Semantic Meaning

Semantic Meaning is the ability to understand the meaning and context of a text provided to a LLM. In context to AI, Semantics mean the understanding of meaning of the sequence of tokens in a array of words, and using it to process further data. It is a crucial part of Natural Language Processing(NLP).

For Example: “The bank was closed due to a flood“. Here, the model identifies that the text refers to money bank and not river bank. By the help of semantic meaning, the model figures it correctly.

Self Attention

Self-attention is a famous mechanism commonly used in NLP, in which it enhances the input sequence of tokens by incorporating additional information about its context. It allows the LLM to weigh each token and understand the relationship between them. It can be intuitively understood as tokens talking to each other and adjusting their meaning simultaneously.

For example: “During summer, I don’t like chai, because it is hot.“ Here, the model, by self attention mechanism, attaches “it“ to “chai“. During summer, we tend to enjoy cold desserts, so, “chai” here gets the preference over summer, and “it“ gets related to “chai“.

Multi head Attention

Multi head attention is a self-attention but the model runs through it several times in parallel, to enable the model capture different relationships at the same time. In this, we focus on different aspects of a tokens and not just a single aspect, as in self-attention. It provides several perspectives of the same sequence of input token embeddings.

For example: A group of three friends are looking at a picture. One of them admires the intensity and variety of colors used, one admires the intricate structures and shapes used, and the other focuses on the shadows and attention to details. All three focus on the same thing, but simultaneously get different aspects and perspectives of the picture.

Temperature

Temperature refers to the degree of creativity a LLM can show while giving it’s output. It is actually a measure of the degree of randomness and unpredictability of a text. A lower temperature means the model will give a deterministic, predictable and less creative output, whereas a higher temperature means the model will give a less deterministic, more predictable, and more creative output. However, with increased temperature, the tendency of having accurate outputs decreases.

Knowledge Cutoff

Each LLM is trained on a particular dataset, which is available till the date of training. The LLM does not have any record of events or data happening after it has been trained. This is known as Knowledge Cutoff, where the LLM is unable to provide results based on real-world recent data. It can provide results based on the data on which it has been trained.

For example: I train a X model, where data has been taken will 31st December, 2024, so, my model will not have any data from 1st January onwards, and will require agents, APIs to get real world data and provide results based that

Decoding AI Jargons

Table of contents

Tokenization

Encoder

Decoder

To understand let’s use `tiktoken`

Vector Embeddings

Positional encoding

Semantic Meaning

Self Attention

Multi head Attention

Temperature

Knowledge Cutoff

Subscribe to my newsletter

Hridayesh Kundu

Hridayesh Kundu

Decoding AI Jargons

Table of contents

Tokenization

Encoder

Decoder

To understand let’s use tiktoken

Vector Embeddings

Positional encoding

Semantic Meaning

Self Attention

Multi head Attention

Temperature

Knowledge Cutoff

Subscribe to my newsletter

Hridayesh Kundu

Hridayesh Kundu

To understand let’s use `tiktoken`