Hello, passionate learners from around the world ✌️

In 2023 ChatGPT from OpenAI reached 100 million users faster than other solutions in Web 2.0 era.

And since then many intelligent models from Anthropic, Cohere, IBM, Goole, Amazon, Meta AI, DeepSeek, HuggingFace come up and also many startups entering the arena. It’s interesting times to invest in our skillset.

Platforms like HuggingFace—the GitHub of AI—serving as open hubs where an entire ecosystem of researchers and developers collaborate to share, fine-tune, and deploy AI models across the spectrum from natural language processing to computer vision. The scale is here 1.4 million models already deployed, with new breakthroughs arriving weekly.

In this blog post, I will try to give a overview of the key components of Large Language Models (LLMs) at a high level, focusing on basic concepts, minimal math, and visual explanations to make complex ideas easy to understand.

Why This Actually Matters

Understanding model architecture isn't just academic. Fine-tuning models, interpreting model cards, and selecting the right model for specific tasks like todays popular agentic architectures can imply the difference between breakthrough performance, costly failures and maybe also security vulnerabilities.

These models are reshaping how we work, learn, and create—right now. Whether you're an educator designing curriculum, a researcher, or simply curious about the technology transforming your daily life, invest in these fundamentals (I also put many resources at the end of the blog).

The technology feels like magic, let’s explore together! 🤗

The Road to Generative AI: Key Milestones

But first, lets start with a quick history of Artificial Intelligence. AI is a discipline with a vast history and many applications in the real world. Very inspiring development phase with amazing research and development breakthroughs. While AI encompasses many approaches, this guide focuses specifically on the architecture that's changing everything: Transformers. The true inflection point came in 2017 with the publication of a paper titled: "Attention is all you need." The work by Vaswani and his friends would fundamentally transform AI capabilities and set the stage for today's generative revolution.

AI Language Modeling

Language models are fundamentally about understanding deep connections between words, concepts, and context—similar to how our own brains process language.

Imagine two friends chatting:

Person 1 (speaking):

"Last night, I was in the studio*, working on a* new track*, tweaking the* melody*, and then I realized I needed to* adjust my..."

At this moment, Person 1 own thought process is already being pulled toward a specific word before they even say it. Their mind is influenced by the words they just used—"studio," "track," "melody," and "adjust"—making "keyboard 🎹**"** feel like the most natural next word.

Person 2 (listening):

As Person 1 speaks, Person 2 is in thinking/listening mode, but what Person 2 expects depends on both Person 1’s words and their own mental associations. Person 2’s interpretation is influenced by Person 1’s context 🎹**.**

Just like in LLM’s, similarity helps pull related concepts together—such as how "melody" and "track" reinforce the idea of music—while attention helps focus on the most relevant words, filtering out less important information to determine meaning.

The Secret Sauce of LLMs: Similarity + Attention

This human conversation mirrors how LLMs work:

Similarity creates connections between related concepts—just as "melody" and "track" naturally point toward music-related completions.
Attention helps filter out noise and focus on what matters most—determining which earlier words are most important for predicting what comes next.

Next-Word Prediction: The Core Task

Like the example with above, at its heart, a Large Language Model has one fundamental job: "next token prediction."

These sophisticated systems learn patterns from massive datasets to predict the next token in a sequence. When you type "Which move in AlphaGo was surprising?" the model:

Processes your prompt
Calculates probabilities for every possible next token
Selects the most likely continuation (or samples from high-probability options)
Repeats until it reaches a natural stopping point

The process continues word by word until the model decides to end the sequence, producing something like: "The most surprising move was 37"

This simple mechanism—predicting one token at a time based on everything that came before—is the foundation for Large Language models, that can now write essays, code, stories, and even simulate conversations.

The complete sequence goes on, until the LLM decides, to bring a special token like |EOS| “End of Sequence” and the answer ends with “The most surprising move was 37” like.

The complete flow illustrated:

The Journey to an LLM artifact

We can imagine these models as a compressed ZIP file of internet data. It contains the so-called million or billion parameter weights (floating-point numbers), which during training are adjusted and learned.

To achieve such behavior, we require high-quality data, substantial computational power, memory, and extensive GPU clusters. Training these models is costly and time-consuming, often taking months. Not many companies can afford the millions of dollars needed to train a model from scratch.

For example, Llama 3 from Meta AI was trained on 24,576 GPU clusters for months, and Meta's Llama 4 is currently being trained on a cluster exceeding 100,000 NVIDIA H100 GPUs. DeepSeek R1 model is trained on a smaller set of GPUs but uses advanced architecture training, which I want to explain in further blog posts, called Reinforcement Learning. This huge computational requirement also raises sustainability concerns, one of the most important topics in training models. A very good session about GPU power consumption is available at the CCC.

Das Potenzial von GPU-Clustern für fortgeschrittenes maschinelles Lernen und Deep Learning-Anwendungen freisetzen - fibermall.com

Let's take a quick journey through these training steps.

Data preparation

Large Language Models are trained on a massive scale of internet data. I mean by large scale, trillions and billions of tokens. In the upcoming sections I'll explain more about tokens. At the same time, we want large diversity and high-quality documents. One popular dataset is CommonCrawl. Common Crawl, a non-profit organization, has been crawling the web since 2007, and actually contains 2.7 billion web pages. If you are interested in a large scale data pipeline and a cleaned up dataset, look at the FineWeb project from HuggingFace.

steps taken to produce FineWeb dataset for LLM training

steps taken to produce FineWeb dataset for LLM training

I don't want to go into the details of data engineering in this post, as it is about LLM concepts, but remember it's trained on large diversity and high quality of data. To see the full pipeline visit FineWeb. Also worth mentioning you can explore some public datasets on atlas.nomic.ai and the diversity of the topics covered in the domains. Also HuggingFace Dataset is a good source to discover more datasets.

From Base Models to Chat Assistants

Next, we train a model for next token prediction. These models are also called base models, and their names typically end with "Base", like Llama-3.1-405B-Base.

However, these base models do not behave like ChatGPT or instruction-tuned models (e.g., Llama-3.1-405B-Instruct) that we experience through web interfaces.

The base models are just the foundation - they can predict token incredibly well but lack the refined conversational abilities of the instruction-tuned versions that power consumer-facing AI assistants.

For example if we prompt Llama-3.1-405B-Base with:

Prompt: "Which move in AlphaGo was surprising?

we get following response sequence:

“Is it possible to explain it?" The following is a question I posed to the AlphaGo team, as part of an academic project: Which move in AlphaGo was surprising? Is it possible to explain it? AlphaGo's moves are often surprising to human players, as they are based on a deep understanding of the game that is difficult for humans to replicate. One example of a surprising move made by AlphaGo…..”

The base model artifacts are produced during the most costly phase: pre-training.

BUT: This is not what we want from an model. Most of the time we're summarizing papers, translating sections, or generating content based on user questions or prompts.

Instruction Tuning

In order to get an assistant model like GPT, we need to align our model for our downstream task, which is awesome. For example, we can align to question answering, classification, translation, different persona roles that the model can take, topic modeling, clustering etc. The list goes on throughout the NLP journey.

Think of it as following: you are giving a task for factual label checking, and we need to create thousands of instruction pairs. Then taking thousands of this possible pairs we can further fine-tune our base model and get an instruction tuned model. Andrej Karpathy explanations is very good.

These instruction pairs look like this: Factual Explanation

Instruction: Explain how photosynthesis works in simple terms.

Response labeled: Photosynthesis is how plants make their food. They take sunlight, water, and carbon dioxide from the air and convert it into oxygen and glucose (sugar). This happens in tiny structures called chloroplasts, which contain chlorophyll—the green pigment that captures sunlight energy. Plants use the glucose for energy and to build structures like stems and leaves. The oxygen is released into the air for us to breathe.

Beyond Instruction Tuning

These data can be created from humans or synthetic data generation. But the story doesn't end here—we need further improvements. Reinforcement Learning or Reinforcement Learning from Human Feedback like OpenAI approaches makes the alignment better.

Reinforcement Learning

Reinforcement learning is an amazing field of artificial intelligence. We've heard in the news about breakthroughs from DeepSeek's pure RL approach. Let's illustrate RLHF or so-called Reinforcement Learning from Human Feedback simply.

Initially, an instruction-tuned model is trained to follow prompts, but it undergoes further fine-tuning through reinforcement learning. During this phase, models interact with prompts, learn from trial and error, and receive human feedback to align responses with user expectations. This iterative process helps LLMs improve accuracy, relevance, and coherence, making them more effective in real-world applications.

The Reward Model in RLHF

The reward model's job is surprisingly simple: it just assigns a numerical score to any response. For example, when the LLM generates multiple answers to "Explain climate change" the reward model might give a score of 8.7 to a clear, accurate explanation and 3.2 to a confusing or inaccurate one. These scores then guide the learning process—the LLM is adjusted to maximize these reward scores, essentially learning to produce responses that humans would rate highly.

Ok lets go further, until now we understand at a very high level, what is AI Language modeling, what’s the task of an training (next token prediction), how different models created. But let’s see the revolutionized Idea of Attention.

Attention is all you need

In order to decode and process language in computers, we need a notion of:

Numbers - converting language to numbers also called embedding space
Similarity
Attention

Tokenizer: The First Gateway to LLMs

This is the first step whenever we interact with an LLM like ChatGPT, Claude or any LLM API. Imagine this as the LLM's Vocabulary. Every time we send a model a prompt, it first gets tokenized.

Why? Because we need a mapping from text to numerical representations that computers can process and tokenization is the first part on the way 🛣️ 🛣️ Almost all of the model providers also have a pricing model based on consumed and output tokens.

Lets say you send ChatGPT the prompt “What is tokenization why we need this”. The prompt gets broken into colored tokens as shown in the image. Importantly, tokens don't always align with complete words—"token" & "ization" are separated into different tokens.

You can visually explore tokenization processes using tools like the https://tiktokenizer.vercel.app/.

Why use subword and not word by word?

Language is indeed complex and diverse, with new words constantly emerging across various languages. Many languages allows for the creation of new words from existing ones (e.g. sunflower), and some languages have even no spaces like Japanese (e.g. 今日はサーフィンに行きます). So our language models need to be generative and capable of capturing many patterns. Building a vocabulary with millions of words is not effective and even not possible.

Tokenizers are algorithms that capture statistical properties of large text corpora on which LLMs are pre-trained. There are different techniques for tokenization, like BPE (Byte Pair Encoding), WordPiece, SentencePiece. In this post I don't go inside, but assume with tokenizers we get an intelligent vocabulary with subword tokens from our corpus of data.

First numbers: Position IDs to token embedding vectors

Remember tokenizer creates our vocabulary and helps us mapping from text to numerical representations.

In general tokens can be anything from words, image patches, speech segments which has an ordered sequence in the nature. In the above example "What a wonderful world." is mapped to the numbers 4827, 261, 10469, 2375, 13 and so on called the Position IDs. These IDs are encoded in the model's inner architecture (Embedding-Matrix) and maps our tokens to a fixed token embedding vector.

But why Positons IDs are so important?, because language is ordered and we should keep track of order for each token later in processing, for example most phenomena in the nature are ordered most not. Imagine machine translation, words can take another position in a sequence.

From these ID's we get fixed vectors, so called token embedding vectors. These embedding vectors has huge dimension for example in ibm-granite/granite-3.1-8b-instruct LLM has 4096 dimension size.

It's all about similarity?

Ok tokenizer, position ids, and what are these token embedding vectors?

We need this because, with the power of linear algebra we can apply mathematical operations. Let's explore these concepts in two dimensions for visualization :)

Notion of similarity

In this embedding space, we can see how words or concepts are arranged based on their meaning. The angle between vectors tells us how similar they are - smaller angles mean greater similarity. This is measured using cosine similarity, which ranges from -1 (completely opposite) to 1 (identical). For example, the apple and orange vectors have a small angle between them, indicating high similarity, while the phone and fruits have a much larger angle, showing they're less related.

Now we have our embeddings and calculate similarity between the embeddings, are we done?

Unfortunately not. These token embedding vectors are not perfect, and should be learned and adjusted during training, because language is all about context.

The Context Challenge: When "Apple" Isn't Just a Fruit

Imagine these situations, how the token embedding for “apple” should be calculated?

The Problem: Finding the Right Embedding

The challenge is that we cannot assign a perfect place for every token in the latent space. Raw embeddings might capture some relationships, but they are often not well-aligned with real-world structures. To fix this, we apply linear transformations, which allow us to adjust the embedding space to better reflect similarities and relationships.

Linear Transformations

So, what are linear transformations? Think of them as matrix operations applied to vectors. These operations can:

Stretch the space to emphasize certain dimensions 📏
Rotate vectors to better align with meaningful directions 🔄
Shear data to adjust relationships between points 📐
Combine all these effects to create a better-structured space

Adjusting Embeddings and Choosing the Best Embedding?

Imagine we want to discover the optimal embedding space that captures the true relationships in our data. Let's explore this with a simple example:

Ahmet is an excellent basketball player 🏀—he is great at jumping, agility, and teamwork.
Sofia is a strong swimmer 🏊‍♂️—she excels in endurance and breathing control.

Looking at the three embedding spaces below, we can immediately see why Embedding 3 is better. It organizes both athletes in relation to their sports while capturing their shared identity as athletes. During the training the so called the Multi-Head Attention Layer decides which Embedding is the best or combines them.

Transformation Magic

If we decide Embedding 3 should be used then we apply a linear transformation with matrix. The values of the matrix is the learnable parameters. We're performing matrix-vector multiplication, which is calculated using multiple dot products.

This process mirrors how our own brains might reorganize concepts—shifting from thinking about "sports equipment" to thinking about "athletes and their specialties" when the context requires it. The difference is that our AI models must learn these transformations through millions of examples rather than through lived experience.

The beauty of this approach is that as the model encounters more data, these transformation matrices continuously refine, creating increasingly nuanced understanding of the relationships between concepts.

The Magic of Attention: Why Context Changes Everything

Until now we've explored similarity (cosine, dot-product) and how linear transformations can create better embeddings. But we're missing something crucial - Attention, the breakthrough that revolutionized AI language understanding.

Let’s take a example—journalist and microphone.

In an ideal world, these two should have a balanced connection in the embedding space, but in real-world training data, that’s not the case. A journalist strongly pulls "microphone", but "microphone" does not strongly pull "journalist".

Why This Asymmetry Exists?

Because in real-world data, "journalist" often appears with words like interview, report, article, media, and yes, microphone. But "microphone" has a much broader range—it appears with singers, podcasters, radio hosts, studio equipment, speakers, and many other unrelated concepts. So, when we ask:

"What does journalist relate to?" → Microphone is a strong association because journalists frequently use microphones.
"What does microphone relate to?" → Journalist is a weak association because a microphone is used by many professions, not just journalists.

Why a Single Linear Transformation Doesn't Work

If we apply only one transformation, we still get a symmetric pull, meaning the model would think that:

"Microphone" should influence "journalist" just as much as "journalist" influences "microphone."
This is incorrect because a microphone is just a tool, and many people use it beyond journalists.

The Fix: Two Linear Transformations

To properly capture this, we need two different transformations. Lets introduce Key and Query. Key is which pulls the other token, and Query is which is pulled. We apply different perspectives depending on whether "journalist" or "microphone" is acting as the key or the query.

Journalist (Key) – It strongly pulls "microphone" (Query) because it's an important tool for their work.
Microphone (Key) – It weakly pulls "journalist" because its use is much broader.

The Formula

Applying two linear transformations on Keys and Queries and then we take the angle between keys and queries. After that we can calculate the similarity via dot-product (the Attention matrix).

Journalist (Key) – Microphone (Query) we want large cosine in similarity (strong pull).

Microphone (Key) – Journalist (Query) we want small cosine in similarity (weak pull).

Every value of this matrices is adjusted during training time, so we get a clearer embedding.

Understanding the Dot Product in Attention

The dot product is the mathematical operation that powers attention. In simple terms:

What it does: Measures how aligned two vectors are with each other.
How it works: Multiplies corresponding elements of two vectors and sums the results.

The Value

But last, there is another component called Value. Think like this, the actual audio content captured by the microphone—it carries the real meaning the journalist wants to process. After computing the similarity between queries and keys (dot product of Q and K), these attention scores are used to weight the Values (V). This means that:

If a key strongly matches a query, its corresponding value is given more importance.
If a key weakly matches a query, its value contributes less to the final output.

Recap: We are extracting from an token embedding the Query, Key and Values based on this trained matrices, and producing a more contextualized token embedding with same dimension

Token embeddings transform words into number vectors, creating a mathematical language.
Linear transformations are the key mathematical operations that create the three different perspectives:
- Each embedding is multiplied by three different matrices to create Query, Key, and Value representations of the same token
- This is how one word can have multiple "views" or "roles" in the attention process
Query perspective (Q matrix transformation): "What am I looking for in other words?"
Key perspective (K matrix transformation): "What aspect of me might others find relevant?"
Value perspective (V matrix transformation): "What information should I contribute if matched?"
Same input, three views: The word "apple" starts as one embedding but is transformed into:
- A Query vector (searching for relevant information)
- A Key vector (advertising what it contains)
- A Value vector (the actual information to be used)
Dot products between queries and keys measure relationship strength, creating the attention map.
Context-sensitive understanding: These transformations allow the model to interpret "apple" differently when it appears near "iPhone" versus "orchard."
Asymmetric relationships are naturally modeled because each token has these three distinct roles.
Multi-head attention applies multiple sets of these transformations in parallel, capturing different relationship types simultaneously.

Multi-Head-Attention (Linear)

One point, as we saw we need to combine between best embeddings, this is done via Multi-Head-Attention (Linear). Below, from the original paper. Imagine these as an intelligent brain which combines the best token embedding based on context. Suppose many brains which are calculating embeddings and choose or combine and weight them based on context.

Multiple attention mechanisms in parallel: Each "head" learns to focus on different aspects of language.

The Linear transformations:

- Lower Linear layers: Project input embeddings into different "perspective spaces" - one might focus on syntax, another on semantics, another on entity relationships.
  - Upper Linear layer: Combines these multiple perspectives into a unified representation.

Scaled Dot-Product Attention: Each head calculates its own attention pattern based on its specialized Query, Key, Value projections.

Are we done with predicting the next token?

Until now, we have explored the attention mechanism. To predict the next token, the contextualized token embeddings pass through a multi-layer perceptron neural network (MLP) or a feedforward neural network (FFNN).

Unlike self-attention, which connects and applies attention to tokens, this process handles each token position separately. As the information flows through this sequence, the model refines its understanding of the relationships and meanings within the text. At this layer, the model generalizes the learned concepts. This is also where most of the model’s parameters reside.

Reading a model card

Some model parameters from ibm-granite/granite-3.1-8b-instruct

Model	8b Dense	Explanation
Embedding Size	4096	each token embedding dimension, which flows through the network
Number of layers	40	40 Transformer blocks
Attention head size	128	each attention head is 128 dimensions, 4096 = 32×128
Number of attention heads	32	32 heads in Attention
Number of KV heads	8	Key-Value projection pairs that are shared across multiple attention heads
MLP hidden size	12800	hidden layer in MLP or FNN
Sequence length (context window)	128k	Maximum process token embeddings at a time
# Parameters	8.1B	total params
# Training tokens	12T	12 trillion training tokens

Embedding Size of 4096 and Number of Layers 40

Number of attention heads 32 and Number of Key/Value heads 8

Feedforward Neural Network

Conclusion

We've journeyed through the inner workings of Large Language Models, uncovering the elegant concepts that enables machines to understand and generate human language. Through our exploration, we learned

The core training objective is surprisingly simple: predict the next token
Embeddings
Attention mechanism
Multi-head attention
Transformer architecture core components

Resources

There is a lot to cover for more advanced deep dive I can suggest following resources.

https://www.youtube.com/watch?v=RFdb2rKAqFw

https://www.youtube.com/watch?v=7xTGNNLPyMI

AI Academy which provides very good insights

DeepLearning.AI

Hands-On Large Language Models: Language Understanding and Generation

Praxiseinstieg Large Language Models: Strategien und Best Practices für den Einsatz von ChatGPT und anderen LLMs (available also in english)

Understanding LLMs: A Simple Guide to Large Language Models