🤖 How to Evaluate AI Systems in the Age of Foundation Models

“Evals are surprisingly often all you need.” — Greg Brockman, OpenAI

AI has evolved rapidly from narrow, task-specific systems to powerful, general-purpose foundation models. But as models become more capable, evaluating them becomes exponentially harder—and more important. In Chapter 3 of AI Engineering, the authors dive deep into the methodologies we can use to evaluate modern AI systems. Here's a distilled and actionable guide to the most important takeaways.

🎯 Why AI Evaluation Matters

Without reliable evaluation, building with AI is like flying blind. You risk shipping unreliable, harmful, or low-quality systems that could erode user trust or even cause real-world harm.

Evaluation serves two main purposes:

Mitigating Risk: Helps you identify potential failure points and design safeguards.
Discovering Opportunity: Enables you to find new capabilities the model may have developed.

The strongest AI systems today are not only powerful—they’re also unpredictable. This makes understanding their weaknesses (and strengths) an engineering necessity, not a luxury.

🚧 The Challenges of Evaluating Foundation Models

1. Open-endedness

Foundation models are capable of many tasks, many of which they were never explicitly trained on. Traditional evaluation—comparing outputs to ground truth—often fails when there’s no clear "correct" answer.

2. Black-box Models

Developers often don’t have access to the internals of models, especially if they're using proprietary systems via API. This opacity limits interpretability and debuggability.

3. Benchmark Saturation

Once a model achieves perfect scores on a benchmark, that benchmark loses value. But the model might still perform poorly in the real world or on different kinds of data.

📏 Core Evaluation Metrics for Language Models

Understanding the foundational metrics helps us interpret downstream performance and detect failure modes early.

🔹 Entropy

Entropy measures how much information a token carries. Higher entropy means more unpredictability—and more bits needed to represent each token.

🔹 Cross-Entropy

Cross-entropy tells us how difficult it is for the model to predict the next token. Lower cross-entropy indicates a better predictive model.

🔹 Perplexity

Perplexity is the exponential of cross-entropy. It measures the uncertainty the model has in predicting the next token.

More structured data → lower perplexity.
Bigger vocabularies → higher perplexity.
Longer context → lower perplexity.

Cross-entropy and perplexity are not only theoretical—they help detect issues like data contamination, where a model performs suspiciously well on benchmark data it may have seen during training.

Exact Evaluation Methods

✅ Functional Correctness

This is the gold standard: Does the system do what it’s supposed to do?

Used in code-generation benchmarks like HumanEval and MBPP.
Binary pass/fail scoring.
Great for tasks with well-defined outputs.

✅ Similarity to Reference Data

When functional correctness isn’t possible (e.g., summarization, translation), compare output to human or AI-generated references.

Exact Match: Does the output match the reference exactly?
Lexical Similarity: How close the words and structure are.
Semantic Similarity: How close the meaning is—measured using embeddings.

Note: Bad reference data can degrade evaluation quality. In translation tasks, researchers often find poor-quality reference translations leading to misleading scores.

🧑‍⚖️ The Rise of AI as a Judge

As foundation models have grown more complex, so have the tools we use to evaluate them. Enter AI judges—using AI to evaluate other AI systems.

Why Use AI as a Judge?

Fast, cheap, and scalable.
Can evaluate on any criteria: correctness, bias, tone, hallucinations, etc.
Can explain their judgments—something many metrics can’t.

How It Works

You can prompt an AI judge in three main ways:

Evaluate a response alone.
Compare a response to a reference.
Compare two responses and pick the better one (used in preference modeling).

A well-designed judge prompt includes:

A clear task (e.g., evaluate helpfulness).
Evaluation criteria (e.g., factual correctness).
A scoring rubric (e.g., 1–5 or good/bad).

AI judges tend to be more reliable when using discrete categories (e.g., 1-5 stars) instead of continuous scoring (e.g., 0.0 to 1.0).

🧠 Limitations of AI Judges

Inconsistency: They’re still probabilistic systems.
Cost and latency: Can introduce delays and extra compute.
Prompt ambiguity: Vague criteria yield poor evaluations.
Bias: Judges may favor certain styles (e.g., longer answers).

Tip: Weaker models can still be useful judges to save cost. Many teams use AI judges as lightweight guardrails to filter outputs before showing them to users.

🛠️ Specialized Judges

Three types of purpose-built AI judges:

Reward Models: Score how "good" a response is based on a (prompt, response) pair. Crucial in RLHF pipelines.
Reference-based Judges: Compare a response to reference answers.
Preference Models: Take two responses and pick the preferred one. Used to align models with human preferences.

🥇Comparative Evaluation: Ranking Models Side by Side

Instead of scoring models in isolation, comparative evaluation directly compares model responses.

Each matchup is called a “match.”
Win rate: Frequency one model is preferred over another
Rankings are derived from these matches using rating algorithms

Comparative evaluation is often more useful for subjective tasks, where it’s easier to say “this one is better” than to give a precise score.

🔢 Challenges

Scalability: Number of comparisons grows fast with the number of models.
Lack of standardization: What’s considered "better" varies by user.
Evaluation noise: Too many trivial prompts or low-effort votes can pollute the data.

🔮 Final Thoughts: The Future of Evaluation

As models continue to improve, exact evaluation becomes harder, not easier. And with increasing model complexity, evaluation itself must evolve.

Takeaways:

Human-in-the-loop remains critical for high-stakes applications.
Exact methods (functional correctness, reference similarity) are still important.
AI as a judge is an increasingly viable tool—but must be used wisely.
Comparative evaluation helps us align models with human preferences—especially for open-ended, subjective tasks.

Evaluation is no longer just a finishing step—it’s part of the AI development lifecycle.

As the field matures, we’ll need better tools, clearer metrics, and thoughtful combinations of human and AI oversight. Getting evaluation right is key to making AI not just powerful, but trustworthy.

Chapter 3: Evaluation Methodology