โญ๏ธ Decoding OpenAI Evals
Learn how to use the eval framework to evaluate models & prompts to optimise LLM systems for the best outputs.
Rohit Agarwal
May 10, 2023 โ 9 min
Conversation on Twitter
There's been a lot of buzz around model evaluations since OpenAI open-sourced their eval framework and Anthropic released their datasets.
While the documentation for evals is superb from both, understanding it for production implementation is hard. My goal was to use these evals in my own LLM apps. So, we'll try to break down the concepts from the libraries and use them in real-life systems.
Ready? Let's focus on the openai/evals
library to start with.
It contains 2 distinct parts:
A framework to evaluate an LLM or a system built on top of an LLM.
A registry of challenging evals.
We'll only focus on the framework in this blog.
๐๐ผ The aim of this blog is to use the eval framework to evaluate models & prompts to optimise LLM systems for the best outputs.
The goal of the blog is not to learn how to submit an eval to OpenAI :)
What is an Eval?
An eval is a task used to measure the quality of output of an LLM or LLM system.
Given an input prompt
, an output
is generated. We evaluate this output
with a set of ideal_answers
and find the quality of the LLM system.
If we do this a bunch of times, we can find the accuracy.
Why use Evals?
While we use evals to measure the accuracy of any LLM system, there are 3 key ways they become extremely useful for any AI app in production.
As part of the CI/CD Pipeline
Given a dataset, we can make evals a part of our CI/CD pipeline to make sure we achieve the desired accuracy before we deploy. This is especially helpful if we've changed models or parameters by mistake or intentionally.We could set the CI/CD block to fail in case the accuracy does not meet our standards on the provided dataset.
Finding blind sides of a model in real-time
In real-time, we could keep judging the output of models based on real-user input and find areas or use cases where the model may not be performing well.To compare fine-tunes to foundational models
We can also use evals to find if the accuracy of the model improves as we fine-tune it with examples. However, it becomes important to separate out the test & train data so that we don't introduce a bias in our evaluations.
Type of OpenAI Eval Templates
OpenAI has defined 2 types of eval templates that can be used out of the box:
Basic Eval Templates
These contain deterministic functions to compare theoutput
to theideal_answers
Model-Graded Templates
These contain functions where an LLM compares theoutput
to theideal_answers
and attempts to judge the accuracy.
Let's look at the various functions for these 2 templates.
1. Basic Eval Templates
These are most helpful when the outputs
we're evaluating have very little variation in content & structure.
For example:
When the output is boolean (true/false),
is one of many choices (options in a multiple-choice question),
or is very straightforward (a fact-based answer)
There are 3 methods you can use for comparison
match
Checks if any of theideal_answers
start with theoutput
includes
Checks if theoutput
is contained within any of theideal_answers
fuzzy_match
Checks if theoutput
is contained within any of theideal_answers
OR anyideal_answer
is contained within theoutput
The workflow for a basic eval looks like this
Given,
an
input
prompt andideal_answers
We generate the output
from the LLM system and then compare the output to the ideal_answers
set by using one of the matching algorithms. This feels very natural to how we do QA on deterministic systems as well today, with the exception that we may not get exact matches, but can rely on the output
being contained in the ideal_answers
or vice-versa.
2. Model-Graded Eval Templates
These are useful when the outputs
being generated have significant variations and might even have different structures.
For example:
Answering an open-ended question
Summarising a large piece of text
Searching through a set of text
For these use cases, it has been found that we can use a model to grade itself. This is especially interesting as we're now exploiting the reasoning capabilities of an LLM. I'd imagine GPT-4 would be especially good at complex comparisons, while GPT -3.5 (faster, cheaper) would work for simpler comparisons.
We could use the same model being used for the generation OR a different model. (In a webinar, @kamilelukosiute mentioned that it might be prudent to use a different one to reduce bias)
There's a generic classification method for model-graded eval templates.
This accepts
The
input prompt
being used for the generation,the
output
generation for the promptoptionally a reference
ideal_answer
It then prompts an LLM with these 3 parts and expects it to classify if the output is good or not. There are 3 classification methods specified:
cot_classify
- The model is asked to define a chain of thought and then arrive at an answer (Reason, then answer). This is the recommended classification method.classify_cot
- The model is asked to provide an answer and then explain the reasoning behind it.classify
expects only the final answer as the output.
Essentially, this is the super simplified workflow:
Let's look at a few examples of this in the real world.
Eg. 1: Fact-checking (fact.yaml)
Given an input
question, the generated output
and a reference ideal_answer
, the model outputs one of 5 options.
"A"
ifa
โb
, i.e., theoutput
is a subset of theideal_answer
and is fully consistent with it."B"
ifa
โb
, i.e., theoutput
is a superset of theideal_answer
and is fully consistent with it."C"
ifa
=b
, i.e., theoutput
contains all the same details as theideal_answer
."D"
ifa
โb
, i.e., there is a disagreement between theoutput
and theideal_answer
."E"
ifa
โb
, i.e., theoutput
andideal_answer
differ, but these differences don't matter factually.
Eg. 2: Closed Domain Q&A (closedqa.yaml)
๐ก What is closed domain Q&A?
Closed domain Q&A is way to use an LLM system to answer a question, given all the context needed to answer the question.This is the most common type of Q&A implemented today where you
1. pull context about a user query (mostly from a vector database),
2. feed the question and the context to an LLM system, and
3. expect the system to synthesize the correct answer.Here's a cookbook by OpenAI detailing how you could do the same.
Given an input_prompt
, the context
or criteria
used to answer the question, and the generated output
- the model generates reasoning and then classifies the eval as a Y or N representing the accuracy of the output
.
For all search and Q&A use cases, this would be a good way to evaluate the completion of an LLM.
Eg. 3: Naughty Strings Eval (security.yaml)
Given an output
we try to evaluate if the output is malicious or not. The model returns one of "Yes", "No" or "Unsure" which can be used to grade our LLM system.
More Examples
There are more examples of eval templates mentioned here. The idea is to use these as starting points to build eval templates of our own and judge the accuracy of our responses.
Implementing OpenAI Evals in Your Project
Armed with the basics of how evals work (both basic and model-graded), we can use the evals
library to evaluate models based on our requirements. We'll create a custom eval for our use case, try running it with a set of prompts and generations, and then also see how we could run evals in real time.
To start creating an eval, we need
The test dataset in the JSONL format.
The eval template to be used
Let's create an eval to test an LLM system's capability to extract countries from a passage of text.
Creating the dataset
It's recommended to use ChatCompletions for evals, and thus the data set should be in the chat message format and contain
the
input
- the prompt to be fed to the completion systemand the
ideal
output - optional, denotes the ideal answer
"input": [{"role": "system", "content": "<input prompt>","name":"example-user"}, "ideal": "correct answer"]
JSONL dataset format
๐ก Note from OpenAI
The library does support interoperability between regular text prompts and Chat prompts. So, the chat datasets could then be run on the non chat models (text-davinci-003) or any other completion function.It's just more clear how to downcast from chat to text rather than the other way around since we would make some decisions around where to put the text in system/user/assistant prompts. But both are supported!
We'll use the following prompt template to extract country information from a text passage.
List all the countries you can find in the following text.
Text: {{passage_text}}
Countries:
We can create our input dataset by filling in passages in the prompt template.
{"input": [{"role": "user", "content": "List all the countries you can find in the following text.\n\nText: Australia is a continent country surrounded by the Indian and Pacific Oceans, known for its beautiful beaches, diverse wildlife, and vast outback. \n\nCountries:"}], "ideal": "Australia"}
...
Save the file as inputs.jsonl
to be used later in the eval.
Creating a custom eval
We extend the evals.Eval
base class to create our custom eval. We need to override two methods in the class
eval_sample
: The main method that evaluates a single sample from our dataset. This is where we create a prompt, get a completion (using the default completion function) from our chosen model, and evaluate if the answer is satisfactory or not.run
: This method is called by theoaieval
CLI to run the eval. Theeval_all_samples
function in turn will call theeval_sample
function iteratively.
import evals
import evals.metrics
import random
import openai
openai.api_key = "<YOUR OPENAI KEY>"
class ExtractCountries(evals.Eval):
def __init__(self, test_jsonl, **kwargs):
super().__init__(**kwargs)
self.test_jsonl = test_jsonl
def run(self, recorder):
test_samples = evals.get_jsonl(self.test_jsonl)
self.eval_all_samples(recorder, test_samples)
# Record overall metrics
return {
"accuracy": evals.metrics.get_accuracy(recorder.get_events("match")),
}
def eval_sample(self, test_sample,rng: random.Random):
prompt = test_sample["input"]
result = self.completion_fn(
prompt=prompt,
max_tokens=25
)
sampled = result.get_completions()[0]
evals.record_and_check_match(
prompt,
sampled,
expected=test_sample["ideal"]
)
evals/elsuite/extract_countries.py
To run the ExtractCountries eval, we can register it for the CLI to be able to run. Let's create a file called extract_countries.yaml
under the evals/registry/evals
folder and add an entry for our eval.
# Define a base eval
extract_countries:
# id specifies the eval that this eval is an alias for
# When you run `oaieval gpt-3.5-turbo extract_countries`, you are actually running `oaieval gpt-3.5-turbo extract_countries.dev.match-v1`
id: extract_countries.dev.match-v1
# The metrics that this eval records
# The first metric will be considered to be the primary metric
metrics: [accuracy]
description: Evaluate if all the countries were extracted from the passage
# Define the eval
extract_countries.dev.match-v1:
# Specify the class name as a dotted path to the module and class
class: evals.elsuite.extract_countries:ExtractCountries
# Specify the arguments as a dictionary of JSONL URIs
# These arguments can be anything that you want to pass to the class constructor
args:
test_jsonl: /tmp/inputs.jsonl
Running the Eval
Now, we can run this eval using the oaieval
CLI like this
pip install .
oaieval gpt-3.5-turbo extract_countries
This will run our evaluation in parallel on multiple threads and produce an accuracy.
In our case, we got an accuracy of 76% which is not great for an operation like this. The following could be the reasons for the current accuracy:
We might not be using the right evaluation spec.
The test data isn't very clean
The model doesn't work as expected.
Going through the eval logs can be helpful here.
Going through Eval Logs
The eval logs are located at /tmp/evallogs
and different log files are created for each evaluation run.
We could go through the text file in an editor, or there are open-source projects like Zeno that help us visualize these logs and analyze them better.
Using custom completion functions
"Completion Functions" are the completion methods of any model (LLM or otherwise). And a completion
is the text output that would be the LLM system's answer to the prompt input
.
In the example above, we chose to use the default completion function (text-davinci-003) of the library for our eval.
We could also write our own completion functions as explained here. These completion functions could be
any LLM model of choice,
a chain of prompts (as popularised by Langchain)
or even AutoGPT!
Useful tips as you run your evals
If you notice evals has cached your data and you need to clear that cache, you can do so with
rm -rf /tmp/filecache
.Wherever Basic Templates can work, avoid Model Graded Templates as they will have lower reliability.
Further Reading
[ARXIV] Discovering Language Model Behaviors with Model-Written Evaluationsand Anthropic's Model-Written Evaluation Datasets
Leaderboard to track, rank and evaluate LLMs and chatbots by HuggingFace
๐๐ป Big thanks to Andrew Kondrich (Creator of OpenAI/evals), Shyamal from OpenAI, Ravi from Glance (Contributor at Llamaindex) and Dev (Gooey) for their invaluable inputs on content, structure, examples and flows.
Subscribe to Portkey Blog
Portkey Blog ยฉ 2024. Powered by Ghost
Subscribe to my newsletter
Read articles from Kavya directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by