[41]

Introduction

LLMs or large language models are a specific type of machine learning model. they are built upon the principles of deep learning and utilize complex neural networks to process and generate human-like texts.

OpenAI’s Latest ChatGPT Model

OpenAI's newest ChatGPT model is O1 and O1 Pro mode.

System 1 vs/ System 2 Thinking in AI

System 1 thinking or "thinking fast," as psychologist Daniel Kahneman called it, refers to a model pretrained on millions of moves in chess or petabytes of internet-scale text (LLMs). Its job is to mimic patterns at rapid speed. The answers come from what's been fed into it, and it's not expected to develop something new.

On the other hand, a System 2 model does not pick knee-jerk responses from a database; it runs a search or simulation across a wide range of potential future scenarios on the probability of best outcomes and chooses the one with the highest score.

What is Generative AI?

Generative AI is AI that generates new, original content rather than finding or classifying existing content. LLMs are a type of generative AI that can communicate using normal human language.

Evolution of GPT Models

GPT stands for Generative Pre-trained Transformer.

NanoGPT, a GPT-based large language model with around 85K parameters, can be described as a "context-based token predictor." OpenAI introduced this family in 2018, with notable members such as GPT-2, GPT-3, and GPT-3.5 Turbo, the latter being the foundation of ChatGPT.

How LLMs work?

ChatGPT is an advanced chatbot that makes use of a new architecture called the transformer. An LLM is an artificial neural network. It is basically a bunch of numbers or parameters connected to each other, similar to how our brain is a bunch of neurons or brain cells that are connected to each other.

Neural networks only deal with numbers. You send in some numbers, and depending on how the parameters are set, other numbers come out. Any kind of content, such as text or images, can be represented as numbers.

Key Concepts in Transformer Architecture

Importance of Normalization

Normalization is an important step in the training of deep neural networks, and it helps improve the stability of the model during training.

The Self-Attention Mechanism

The self-attention layer is perhaps the heart of the transformer and the GPT. It’s the phase where the columns in the input embedding matrix "talk" to each other.

The self-attention layer is made up of several heads. The main goal of self-attention is that each column wants to find relevant information from other columns and extract their values, and does so by comparing its query vector to the keys of those other columns, with the added restriction that it can only look in the past.

Multi-Layer Perceptron (MLP)

In the context of Natural Language Processing (NLP), the lower layers might learn grammar, syntax, and simple word associations, while the higher layers might capture more complex semantic relationships, discourse structures, and context-dependent meaning.

Softmax: The Probability Distribution Function

The Purpose of Softmax

The softmax operation is used as a part of self-attention. Its goal is to take a vector and normalize its values so that they sum up to 1.0. However, it’s not as simple as dividing by the sum. Instead, each input value is first exponentiated:

a = exp(x_1)

This has the effect of making all values positive. Once we have a vector of exponentiated values, we can then divide each value by the sum of all the values. This will ensure that the sum of the values is 1.0. Since all the exponentiated values are positive, the resulting values will be between 0.0 and 1.0, providing a probability distribution over the original values.

Numerical Stability in Softmax

However, there’s a slight complication. If any of the input values are quite large, then the exponentiated values will be very large. We will end up dividing a large number by a very large number, which can cause issues with floating-point arithmetic.

One useful property of the softmax operation is that if we add a constant to all the input values, the result remains the same. So, we can find the largest value in the input vector and subtract it from all the values. This ensures that the largest value is 0.0, keeping the softmax numerically stable.

Why is it Called "Softmax"?

The "hard" version of this operation, called avgmax, simply finds the maximum value, sets it to 1.0, and assigns 0.0 to all other values.

In contrast, the softmax operation serves as a "softer" version of that. Due to the exponentiation involved in softmax, the largest value is emphasized and pushed towards 1.0, while still maintaining a probability distribution over all input values. This allows for a more nuanced representation that captures not only the most likely option but also the relative likelihood of other options.

The big picture on LLMs and generative AI

Table of contents