Self attention Transformers that made LLMs possible

Introduction

Transformers are the backbone of the foundation to Large language models like GPT, Mistral.

Its a deep learning architecture designed to process text efficiently using a technique called - “self attention” where each component of the text is responsible for identifying the relationship between itself and the rest of the text or speech.

The above image is from the paper that introduced the transformer architecture - “Attention is all you need”[1]. It has revolutionized NLPs by introducing parallel processing and replacing recurrent networks (RNN/LSTMs) with self-attention.

Don’t worry if you do not understand the building blocks - by the end of this series I am hoping to unravel the mystery of transformers and its relationship with LLMs.

Key Steps of how Transformers power LLMs

Self Attention

When we query or prompt a LLM , it is capable of removing stop words and only focus on some of the key words of the query/prompt. Transformers help models with self-attention.
Parallel Processing

In this step lets understand how NLP techniques like Recurrent Neural Network used to work towards sentiment analysis before the Transformer architecture.

RNN architecture. Image taken from [2]

Lets assume a query - “What is the cost of apple"?”

x1 = “what” , x2 = “is” , x3= ……..

RNN as a sequential model architecture processes each word and and passes the processed token to the next node. This uses extensive GPU computation and has been the key challenge to NLP adoption.

Enter the world of Transformers.

It can capture the context of multiple words simultaneously and helps parallel processing, bringing incredible efficiency.

Multiple Layer Stacking

Deep layers of neural network tries to understand the relationship between the text and multiple context of the text. It is closer to the same way a human brain is capable of generating context of each word.
Pre-Training and Fine-Tuning

PreTraining a model is already performed with massive datasets and large unlabelled corpus by large companies like Meta, OpenAI etc. This is computationally very expensive.

What we can come up with a small labelled corpus and fine tune these large models to work for a particular domain or task like code generation , chatbots etc. This is far less expensive and can be performed with significantly less computation.

LLMs are nothing but next sequence predictor (BERT model was integrated in google search) - sequence could be a word while in Deepseek , Gemini these sequence could be multimodal context.

An analogy that could help unravel the mystery :

For the first 12-15 years of a child’s life, parents and Schools tries to provide the education that makes the child understand and speak a language, capable to understand the world through diverse topics like history , science, geography. Similarly LLMs are built by large enterprises with billions of parameters to become language experts.

But how can the child become a domain expert? A degree college spends 4 years making the same kid an expert in healthcare, engineering etc. - Yes Fine-Tuning helps it become that domain expert.

You must have heard about the releases from LLAMA , GPT 3.5 etc having more than billion parameters.

Parameters contains weights - information of the words ad the relationship - trained during the journey of building LLMs.

Lets use an LLM in practice

import requests

# Define the prompt for LinkedIn post generation
post_prompt = (
    "Write an engaging LinkedIn post explaining Transformers in AI. "
    "Keep it concise, informative, and beginner-friendly. Use simple language and add a call-to-action."
)

# Create the request payload
payload = {
    "model": "llama3-8b-8192",  # You can also try "llama3-70b-8192" for a larger model
    "messages": [
        {"role": "system", "content": "You are an AI assistant that writes engaging LinkedIn posts on AI topics."},
        {"role": "user", "content": post_prompt}
    ],
    "temperature": 0.7,  # Higher value for more creative output
    "max_tokens": 300  # Limit the response length
}

# Set headers
headers = {
    "Authorization": f"Bearer {GROQ_API_KEY}",
    "Content-Type": "application/json"
}

# Make the API request
response = requests.post(GROQ_API_URL, json=payload, headers=headers)

# Extract and print the generated LinkedIn post
if response.status_code == 200:
    linkedin_post = response.json()["choices"][0]["message"]["content"]
    print("\n🔹 **Generated LinkedIn Post:**\n")
    print(linkedin_post)
else:
    print("Error:", response.json())

This is a pre-trained LLAMA model which when prompted to generate a LinkedIn post to explain transformers , does the job without having to fine-tune it.

Code generation ?

import requests

# Define the prompt for code generation
code_prompt = "Write a Python function that calculates the Fibonacci sequence up to n numbers."

# Create the request payload
payload = {
    "model": "llama3-8b-8192",  # You can also try "llama3-70b-8192" if supported
    "messages": [
        {"role": "system", "content": "You are an AI assistant that generates Python code."},
        {"role": "user", "content": f"{code_prompt}"}
    ],
    "temperature": 0.2,  # Lower temperature for more deterministic output
    "max_tokens": 200  # Limit the response length
}

# Set headers
headers = {
    "Authorization": f"Bearer {GROQ_API_KEY}",
    "Content-Type": "application/json"
}

# Make the API request
response = requests.post(GROQ_API_URL, json=payload, headers=headers)

# Extract and print the generated code
if response.status_code == 200:
    generated_code = response.json()["choices"][0]["message"]["content"]
    print("\n🔹 **Generated Python Code:**")
    print(generated_code)
else:
    print("Error:", response.json())

LLAMA does the job again without having to fine-tune it.

In the next series we will learn more about the self attention feature of Transformers.

"Transformer"s made easy! - Part 1

Table of contents

Introduction

Key Steps of how Transformers power LLMs

Lets use an LLM in practice

Subscribe to my newsletter

Sarnab Podder

Sarnab Podder