Introduction

In AI development, evaluating an LLM’s performance using test cases based on your understanding of what the LLM is supposed to do is critical. These evaluations, commonly called “Evals”, serve as test cases to help you assess whether your model behaves the way it should. In this post, I’ll explain why Evals are essential, show how they apply in Oracle APEX environments, and provide examples you can adapt to your projects.

My Aha Moment

I recently came across a Y Combinator video on YouTube that discussed Meta-Prompting. While Meta-Prompting is an interesting topic in its own right, it was something else they said that made me stop and think. They were showcasing a Prompt from an AI Startup that focuses on Customer Service and highlighted the fact that the company does not consider their AI Prompts to be their primary Intellectual Property (IP). This surprised me because you would have thought that prompts would be the most important asset for a business that has essentially built a wrapper on ChatGPT.

This startup considers the thousands of Evals (Test Cases) they have developed based on their deep domain knowledge of Customer Service as their primary IP.

Why Create Evals?

Measure Model Quality and Progress

Evals give objective metrics to track:

Accuracy, fluency, coherence, or truthfulness
Improvements across model iterations
New feature emergence (e.g., tool use, reasoning)

💡

Without evals, you can’t tell if the latest model is better than the last one.

Align Model Behavior with Product or User Goals

Well-designed evals ensure the model performs as expected for:

Specific tasks (e.g., summarization, classification, document recognition)
Business KPIs (e.g., ticket deflection, content moderation accuracy)
User satisfaction and trust

💡

Evals bridge the gap between AI’s potential and actual product value.

Identify Weaknesses, Gaps, and Safety Issues

Evals uncover:

Hallucinations, bias, toxicity, and overconfidence
Weak performance on edge cases or minority groups
Failure modes in real-world or adversarial conditions

💡

If you don’t measure, you don’t know if there's anything to fix.

Compare Across Models, Versions, and Configurations

Evals allow rigorous A/B testing of:

Different AI platforms and the different models they offer
Prompt templates, temperature settings, or tool-using agents

💡

If you can prove that you get just as good an answer using gpt-4.1-nano as I do using gpt-4.1-mini, then you can save a lot of money, and your responses will be faster.

💡

If I tweak my prompt a certain way, can I gain a 2x improvement in accuracy? Without Evals, I have no way to prove that.

Trustworthy, Responsible Deployment

For enterprise use cases, evals provide:

Documentation of model performance and limitations
Assurance of compliance (e.g., fairness, explainability)
Evidence for audits or governance boards

💡

Evaluation is Responsible AI.

Types of Evals

There are two primary types of Evals. I will use examples from my APEX Developer Blogs Website to illustrate them.

Code Driven

The best way to evaluate the result of an interaction with an LLM is to use code to check its work. To do this, however, the result of the AI interaction must be empirical. It’s probably easiest to describe this using an example.

Example - Blog Classifier

Before allowing blogs onto APEX Developer Blogs, I first check them for relevance related to 5 areas of interest to APEX Developers (APEX, ORDS, OCI, SQL, and PL/SQL). I pass a prompt and the content of the post to OpenAI, and ask it to check the post for relevance, specifying that I want a JSON object in response that looks like this:

{"APEX":2,"ORDS":0,"OCI":1,"SQL":0,"PLSQL":2}

For each category, I am looking for a score from zero to five. If a post receives a score of two or more across all categories, I allow it. Otherwise, I add it to a queue of rejected posts, which I review manually every so often.

This response is something I can easily test programmatically:

Verify it is valid JSON using json_object.parse. If an -40441 error occurs, then the JSON is not valid.
Verify that all five categories are represented. Use json_obj.has to check for each field.
Verify that the scores are between zero and five. Use json_obj.get_Number to verify the scores are within range.
Store the scores so that I can test the same blog posts with different parameters (new models, changes to the temperature, etc) and see how the scores change. Log all calls to AI and store the request and response (more on this later).

LLM Driven

Of course, if you could use code to test all of the responses from an LLM, then we would not need the LLM to start with. The LLM-driven approach involves making a second call to the LLM to check that the output is correct, or at least meets the standards you are aiming to achieve.

Example - Blog Summarizer

The second step when ingesting blogs in APEX Developer Blogs is to create a simple Gist of the post. This allows people to get the gist of the post before reading it in full. Much as I would like to, there is no way I could read every post and write a summary myself.

So, how can we test if the summary that OpenAI generates is any good? The answer is to pass the post and the summary back to the LLM, asking it to rate the summary. We could use a prompt like the one below to do this:

# BACKGROUND
- I write summaries/gists of blog posts so that users can see a preview of the post before reading it.
- I want to make sure my summary is high quality. 
- What follows is the summary ##SUMMARY## followed by the blog post ##BLOG POST##. 
# TASK
- Carefully compare the summary to the blog post and assess the quality of the summary based on conciseness and readability. 
- The scores should be between 0-5 with 5 being excellent and 0 being extremely poor. 
# OUTPUT
- Return the scores in a JSON object that looks like this: {"READABILITY":4,"CONCISENESS":2}
##SUMMARY##
<<Summary Goes Here>>
##BLOG POST##
<<Blog Post Goes Here>>

This returns a JSON score, which we can evaluate using PL/SQL.

How can we do Evals in APEX?

Using Historical Data

The easiest way to run evals is on LLM calls that have already been run. To do this reliably, you need to be able to determine the parameters and payload that went into the LLM call, as well as the response. I suggest creating two tables: one to store AI Configurations and another to log all your LLM calls.

⚙️ Configs

Create a table to store your AI Configs, which includes:

Config Name
AI Provider, Model, & REST EndPoint
Max Input & Output Tokens
Temperature
APEX Web Credential
Instructions / System Prompt

Storing this information in a table allows you to easily make changes to the parameters and keep track of which configuration generated which output.

🪵 Log Table

You should be logging every call you make to the LLM. This provides the foundational data for running your Evaluations. Your log table should include the following:

Foreign Key to your Config table, so you know which config generated which log
Response Time (to measure the performance of LLM API calls using different models)
Outbound JSON Payload to the LLM REST API
Response JSON returned from the LLM REST API

With a combination of the Config and the Log Tables, you have everything you need to run evals against historical LLM calls.

📚 Eval Library

Diagram Showing the Eval Library Approach

Being proactive about Evals takes more effort. You need to build a library of tests (and expected results) as well as code that can run these tests against your prompts and data. This is where your domain knowledge comes into play. You need to think of all the possible scenarios that could come into play and design tests that cater for those (not forgetting the edge cases, of course). These evals (or tests) represent the knowledge you have on the subject that perhaps no one else knows.

You should continually add to and adjust the library over time as new use cases and user behaviors emerge.

Given that we run APEX on a database, APEX is the obvious tool for maintaining such a library. You can also use APEX to run your evaluations, track results, and report on them.

Red Teaming

While traditional Evals measure whether an AI model meets your quality and performance standards, red teaming focuses on discovering its failures. This involves intentionally crafting contrarian inputs to elicit undesired behavior, such as biased, toxic, or misleading outputs.

Red teaming helps answer questions like:

Can the model be jailbroken to ignore safety instructions?
Does it respond differently to subtly biased prompts?
Will it hallucinate plausible-sounding but false information?
Does it degrade disproportionately on edge cases (e.g., ambiguous phrasing)?

In enterprise use, this kind of testing is critical for:

Hardening your application against misuse
Meeting compliance or governance obligations
Understanding risk before deployment at scale

💡

Even if your day-to-day use cases seem benign (like summarizing or classifying), red teaming ensures you’re not blindly trusting the LLM’s output. It complements evals by simulating how your model could go wrong, intentionally and unpredictably.

🔗 This post from Anthropic provides an overview of Red Teaming and how they do it.

Conclusion

As APEX developers, test-driven development is second nature, but when it comes to AI, it’s easy to overlook evaluation in the excitement of prompt engineering. Evals give you a structured, repeatable, and objective way to track progress, detect regressions, and justify AI decisions in enterprise environments. Whether you’re classifying content or summarizing documents, start small by logging your LLM calls. Then build your Eval Library over time.

🔈

It’s not just good practice, it’s responsible AI.

Why Evals are Important in AI Development

Table of contents