Introduction to Generative AI

Generative AI (GenAI) refers to a class of artificial intelligence models capable of generating human-like content, such as text, images, audio, or code. One prominent example is GPT (Generative Pre-trained Transformer), developed by OpenAI. GPT is a large language model trained on massive amounts of text data to understand and generate natural language. It uses a transformer-based architecture to predict and produce coherent, contextually relevant text based on the input it receives. GPT powers applications like chatbots, content creation tools, and coding assistants, making it a cornerstone of modern GenAI systems.

Tokenization

Tokenization is the process of breaking down text into smaller units called tokens, which can be words, subwords, or characters. This step is essential in NLP to help models understand and process language. For example, the sentence:
"ChatGPT is smart!"
might be tokenized as:
["Chat", "GPT", "is", "smart", "!"]
or using subword tokenization:
["Chat", "G", "PT", "is", "smart", "!"]
Tokenization enables models like GPT to convert text into a numerical format for efficient processing and learning.

Vector embedding

Vector embedding is the process of converting words, sentences, or documents into numerical vectors that capture their semantic meaning. These vectors are plotted in a high-dimensional space (often visualized in 3D), where similar meanings are placed closer together. For example, the words "king" and "queen" would have embeddings close to each other, reflecting their related meanings, while "apple" and "car" would be farther apart. This spatial representation allows AI models to understand and compare the meaning of language based on distance and direction between vectors.

Positional Encoding

Positional encoding is a technique used in transformer models (like GPT) to inject information about the order of words in a sequence, since transformers don't process data sequentially like RNNs. It assigns each word a unique position-based vector, which is added to its word embedding. This helps the model understand the relative and absolute position of tokens in a sentence. For example, in the sentence "I love AI", positional encoding ensures that the model knows "I" comes before "love" and "love" before "AI", preserving the sentence structure and meaning.

The most common types of positional encodings are:

Sinusoidal Positional Encodings (used in the original Transformer): Uses constant vectors built with sine and cosine functions
Learned Positional Encodings (used in BERT and GPT): Vectors are learned during training
Rotary Positional Encodings (RoPE, used in Llama models): Uses constant vectors built with rotational matrices
Relative Positional Encodings (used in T5 and MPT): Based on distances between tokens rather than absolute positions
Attention with Linear Bias (ALiBi, used in Falcon models): A bias term added to attention scores based on token distances

Self-attention mechanism

Self-attention is a core mechanism in transformer models that allows each word in a sequence to focus on other relevant words when generating its representation. It helps the model understand context by assigning weights to different words based on their importance.

For example, in the sentence "The cat sat on the mat because it was tired," self-attention helps the model understand that "it" refers to "cat", not "mat." Each word attends to others, enabling the model to capture dependencies regardless of their position in the sentence. This makes self-attention powerful for tasks like translation, summarization, and text generation.

Softmax

Softmax is a mathematical function used to convert a vector of raw scores (logits) into probabilities that sum to 1. It highlights the most important values while suppressing the others, making it useful in classification tasks and attention mechanisms.

For example, given scores [2.0, 1.0, 0.1], softmax transforms them into probabilities like [0.65, 0.24, 0.11]. The highest score gets the highest probability. In transformers, softmax is used in the attention mechanism to determine how much focus to give to each word in the input.

Multithread attention

Multi-head attention is an extension of the self-attention mechanism used in transformers. Instead of computing attention once, the model runs multiple attention operations (heads) in parallel. Each head learns to focus on different parts of the input, capturing various types of relationships and features.

For example, in the sentence "The cat sat on the mat," one head might focus on subject-verb relations ("cat" and "sat"), while another might focus on spatial context ("on" and "mat"). These multiple heads' outputs are then combined, giving the model a richer understanding of context and relationships in the sequence.

Temperature

Temperature is a parameter used in language models like GPT to control the randomness of predictions during text generation. It adjusts the softmax distribution of output probabilities:

A low temperature (e.g., 0.2) makes the model more confident and deterministic, favoring high-probability words.
A high temperature (e.g., 1.0 or above) makes the model more creative and random, allowing less likely words to be chosen.

For example, with high temperature, the model might say:
"The moon sings to the forest."
Whereas with low temperature, it might say:
"The moon shines at night."

It’s a useful tool to balance between accuracy and creativity.

Knowledge cutoff

Knowledge cutoff refers to the most recent point in time up to which a language model, like GPT, has been trained on data. The model does not know about events, facts, or updates that occurred after this date.

For example, if a model’s knowledge cutoff is June 2024, it won’t be aware of anything that happened after that, like new technologies, political events, or product releases post-June 2024. This limitation is important to consider when asking the model for up-to-date or real-time information.

Vocab size

Vocab size refers to the total number of unique tokens (words, subwords, or characters) that a language model like GPT can recognize and process. It defines the token-to-ID mapping used during training and inference.

For example, GPT-3 has a vocabulary size of around 50,000 tokens using byte pair encoding (BPE). This includes not just common words like "cat" or "computer," but also parts of words, punctuation, and special symbols. A larger vocabulary size allows better handling of diverse languages and rare words, but also increases model complexity and memory usage.

Transformers

Transformers are a type of deep learning architecture introduced in the paper “Attention is All You Need” (2017) by Vaswani et al. They revolutionized natural language processing by using self-attention mechanisms instead of traditional recurrent layers (like in RNNs).

Transformers process entire input sequences in parallel, allowing them to model long-range dependencies more efficiently. The architecture consists of encoders and decoders built from layers of self-attention and feedforward networks. Models like GPT, BERT, and T5 are all based on the transformer architecture, powering tasks such as text generation, translation, and summarization.

https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

GenAI-intro