We all know that LLMs thrive in unstructured environments, capable of consuming large amounts of unstructured text and outputting more unstructured text based on the unstructured text prompts you provide. Of course, I have written articles in the past about some of the guard rails you can put around LLMs so that their outputs are more consistent and more structured.

However, when you upgrade, fine-tune, or otherwise change an LLM model you need a small, repeatable harness that can tell you—quickly and deterministically—whether the new weights behave at least as well as the previous ones, especially if this LLM is utilized in a production environment. In the data science community, these are referred to as evals, which stands for an LLM evaluation suite. An evaluation suite is nothing more than a set of ground-truth examples (both input and output), a driver that feeds prompts to the model, and a metric that converts model output into a score. Below is the distilled shape of such a harness, illustrated with excerpts simplified from some of my own code.

Every eval starts with samples, which are ground truth examples that contain both the inputs that can formulate a prompt, as well as the expected output. In the case of an LLM task which extracts structured outputs from unstructured text, your sample would look like a json with an input field that contains the unstructured text, and an output field which has a nested json of the expected fields and outputs extracted from the unstructured text.

Evals also need a metric. The metric owns the ground-truth data, the model outputs, and the logic that turns the two into a number:

from abc import ABC, abstractmethod

class BaseMetric(ABC):
    def __init__(self, truth: dict, outputs: dict, prompt: str | None = None):
        self.truth = truth
        self.outputs = outputs
        self.prompt = prompt or ""
        self.prediction: dict = {}

    @abstractmethod
    async def calculate(self) -> dict:
        ...

    async def evaluate(self) -> dict:
        results = {}
        for run, prediction in self.outputs.items():
            if not isinstance(prediction, dict):
                raise ValueError("Output format is incorrect")
            self.prediction = prediction
            results[run] = await self.calculate()
        return results

The calculate method can do a simple exact match, call conventional text metrics, or even invoke a second LLM that acts as a judge. Because the interface is asynchronous, you can batch or parallelise at the call site without changing the metric’s public surface. Ultimately, calculate needs to return some kind of result for your metric (typically a number that represents performance. Different metrics have different uses - for example, you can devise a simple metric that makes sure that all of the expected extracted fields are derived from the unstructured text. This would test to see if the model is hallucinating - essentially making it a unit test. On the other hand, you can write a metric which goes into each json field and asks an LLM to assess whether the extracted field is conceptually similar to the expected field (e.g. perhaps the expected field is “10 sq mi” but the extracted field is “10 square miles” and provide a rubric for the LLM to grade these differences depending on your use case. This latter example is more like a traditional metric that can glean insight into the LLM’s behavior.

Then, you’ll need a driver that takes every sample you have, generates the prompts from the inputs, connects to whatever LLM model you’re evaluating, receives their outputs, and then runs them through whatever metrics you’ve developed:

import asyncio, json
from pathlib import Path

async def evaluate_sample(sample_path: Path, metric_cls):
    sample = json.loads(sample_path.read_text())
    truth, outputs, prompt = sample["truth"], sample["outputs"], sample["prompt"]
    metric = metric_cls(truth=truth, outputs=outputs, prompt=prompt)
    return await metric.evaluate()

def run_eval(data_dir: Path, metric_cls):
    futures = {p.name: evaluate_sample(p, metric_cls) for p in data_dir.glob("*.json")}
    results = asyncio.run(asyncio.gather(*futures.values()))
    (data_dir / "results.json").write_text(json.dumps(dict(zip(futures.keys(), results)), indent=2))

With roughly fifty lines of code, we have a complete eval loop: read sample, prompt model, grade output, aggregate scores, persist results. Because each component is decoupled, you can swap the metric for a new one, add more samples, or change the model endpoint without touching the rest of the system.

The larger your cohort of samples, the more trust you can build with the end users of your LLM products. It also makes life easier for you when new LLMs are released, so you can have a clearer view of how well these new models perform with the same tasks. Persistent, self-contained evals make model development sustainable. They catch silent regressions before a model reaches production, quantify gains from fine-tuning runs, and establish crisp acceptance criteria between research and engineering teams. When wired into continuous integration, the suite becomes an executable specification of expected behaviour: if you upgrade a dependency or tweak the prompting strategy, the pull request will fail until the model is at least as good as yesterday’s.

This eval pattern is simple yet powerful. Define a metric interface, write a handful of helper functions, build a thin driver, and commit the whole thing to version control. The result is a living benchmark that travels with your codebase and scales with your ambitions. However, in my experience, the toughest part of this whole process is not the code that needs to be written, but rather the samples that need to be collected, as well as the establishment of the ground truth. This may require you to reach out to domain knowledge experts or your end users to determine what exactly counts as ground truth in your scenario.

The importance of LLM Evals

Subscribe to my newsletter

Edward Tian

Edward Tian