Architecture, Training, and Challenges

Foundation models, like ChatGPT, Claude, and Gemini, are powering a new generation of intelligent applications. But what makes these models tick? How are they built, trained, and refined? And why do they sometimes hallucinate or behave unpredictably?

In this article, we'll dive deep into the core concepts behind foundation models, drawing from Chapter 2 of AI Engineering—an essential guide for anyone looking to build or work with large-scale AI systems.

🧠 What Is a Foundation Model?

A foundation model is a large machine learning model trained on massive amounts of data to generalize across many downstream tasks. These models are trained using a two-step process:

Pre-training: Teaches the model language and reasoning patterns using large internet-scale datasets. However, the model may not be safe or aligned with human expectations.
Post-training: Aligns the model with human preferences using curated data and reinforcement learning techniques.

📚 It All Starts with Data

“An AI model is only as good as the data it was trained on.”

Training data plays a foundational role. Common sources include:

Common Crawl (a massive public web archive)
C4 (a cleaned version by Google)
Curated Reddit links (used by OpenAI for GPT models)

Three golden goals for training data:

Quantity: The more tokens (words/subwords), the better the learning capacity.
Quality: Clean, factual, non-toxic content.
Diversity: Coverage of different domains, styles, and topics.

🏗️ Modeling: Architecture and Scale

Before training starts, engineers must make two key decisions:

1. Architecture

The dominant design for language models today is the Transformer, introduced in 2017. It replaced RNN-based seq2seq models due to these benefits:

Parallel processing of input tokens
Attention mechanism to weigh the importance of each token
Better handling of long sequences

A typical transformer block includes:

Attention module: Computes relevance between tokens using query, key, and value matrices.
MLP module: A feedforward neural network that captures complex patterns using non-linearities (like ReLU or GELU).

2. Scale

Foundation models are often described using three numbers:

Number of parameters (e.g., LLaMA-13B = 13 billion parameters)
Number of training tokens (e.g., 60B for a 3B parameter model)
FLOPs (Floating Point Operations used for training)

The scaling law tells us how to balance model size and data:

For compute-optimal training, the number of training tokens should be about 20x the number of parameters.

🔧 Post-Training: Aligning with Humans

Pre-trained models often reflect internet biases or misunderstand prompts. Post-training fixes this through:

Supervised Finetuning (SFT): Teaches the model how to respond in a conversation using high-quality example (prompt, response) pairs.
Preference Finetuning: Aligns the model to human preferences using techniques like:
- RLHF (Reinforcement Learning from Human Feedback)
- DPO (Direct Preference Optimization)
- RLAIF (Reinforcement Learning from AI Feedback)

Think of it like taming a wild animal:

Pre-training = untamed beast
SFT = social training
Preference finetuning = giving it a smiley face

🎲 Sampling: How Models Generate Output

Unlike traditional software, AI models are probabilistic, not deterministic. They sample from possible outputs, making them:

Flexible and creative for open-ended tasks
Unpredictable and inconsistent for sensitive ones

Sampling Strategies:

Greedy sampling: Always pick the most probable next word.
Temperature: Controls randomness. Higher = more creativity.
Top-k: Sample from the top-k highest probability tokens.
Top-p (nucleus sampling): Sample from tokens with cumulative probability ≥ p.
Beam search: Explore multiple best paths.

To improve output quality, test-time compute techniques generate multiple responses and select the best one.

🧾 Structured Outputs

In real-world applications, structured outputs (like JSON or SQL) are crucial.

Methods to ensure structure:

Prompting: Ask the model to reply in a format.
Post-processing: Clean up outputs manually.
Constrained sampling: Restrict sampling to valid outputs.
Finetuning: Train the model specifically on structured formats.

⚠️ The Twin Challenges: Inconsistency and Hallucination

Inconsistency:

The same prompt can produce different outputs due to sampling randomness. Solutions include:

Caching responses
Setting fixed generation parameters
Using memory or prompt engineering

Hallucination:

A model “makes things up” because it can’t always distinguish between real and generated data. Fixes include:

Reward models that penalize incorrect outputs
Verification layers that ask models to cite sources
Factual finetuning with real-world data

📌 Final Thoughts

Understanding how foundation models are built, trained, and refined is crucial for any AI engineer, researcher, or developer. Whether you're working on chatbots, enterprise automation, or creative tools, your success depends on:

Choosing the right architecture
Scaling with the right data
Applying post-training for safety
Sampling smartly for quality
Guarding against hallucinations

The era of foundation models is just beginning. Knowing how they work lets you build with confidence—and innovate responsibly.

Chapter 2: Understanding Foundation Models