Types of Benchmarks for LLM

Evaluating the capabilities of large language models (LLMs) involves using a variety of benchmarks that test different aspects of their knowledge, reasoning, and problem-solving skills. These benchmarks encompass a wide range of domains, from common sense and social reasoning to complex mathematical problem-solving and programming. Here's an introduction summarizing the key benchmarks used to assess LLMs:

MMLU

The MMLU benchmark is a test that measures the breadth of knowledge and problem-solving ability acquired by large language models during pretraining.

HellaSwag

The HellaSwag benchmark challenges a language model's ability to understand and apply common sense reasoning by selecting the most logical ending to a story.

PIQA

The PIQA benchmark tests a language model's ability to understand and apply physical commonsense knowledge by answering questions about everyday physical interactions.

SIQA

The SIQA benchmark evaluates a language model's understanding of social interactions and social common sense by asking questions about people’s actions and their social implications.

Boolq

The BoolQ benchmark tests a language model's ability to answer naturally occurring (generated in unprompted and unconstrained settings) yes/no questions, testing the models ability to do real-world natural language inference tasks.

Winograde

The Winogrande benchmark tests a language model's ability to resolve ambiguous fill-in-the-blank tasks with binary options, requiring generalized commonsense reasoning.

CQA:

7-Shot The CQA benchmark assesses the performance of language models on multiple-choice question-answering, requiring different types of commonsense knowledge.

OBQA

The OBQA benchmark evaluates a language model's ability to perform advanced question-answering with multi-step reasoning, commonsense knowledge, and rich text comprehension, modeled after open book exams.

ARC-e

The ARC-e benchmark tests a language model's advanced question-answering skills with genuine grade-school level, multiple-choice science questions.

ARC-c

The ARC-c benchmark is a more focused subset of the ARC-e dataset, containing only questions answered incorrectly by common (retrieval-base and word co-occurrence) algorithms.

TriviaQA

5-shot The TriviaQA benchmark tests reading comprehension skills with question-answer-evidence triples.

HumanEval

Pass@1 The HumanEval benchmark tests a language model's code generation abilities by evaluating whether its solutions pass functional unit tests for programming problems.

MBPP

The MBPP benchmark tests a language model's ability to solve basic Python programming problems, focusing on fundamental programming concepts and standard library usage.

GSM8K

The GSM8K benchmark tests a language model's ability to solve grade-school-level math problems that frequently require multiple steps of reasoning.

MATH:

4-shot The MATH benchmark evaluates a language model's ability to solve complex mathematical word problems, requiring reasoning, multi-step problem-solving, and the understanding of mathematical concepts.

AGIEval

The AGIEval benchmark tests a language model's general intelligence by using questions derived from real-world exams designed to assess human intellectual abilities (college entrance exams, law exams, etc.).

BBH

The BBH (BIG-Bench Hard) benchmark focuses on tasks deemed beyond the abilities of current language models, testing their limits across various reasoning and understanding domains.

These diverse benchmarks provide a comprehensive evaluation of large language models, helping researchers understand their strengths and areas for improvement across various knowledge and reasoning domains.

Different Types of Benchmarks for LLM

Table of contents