Paper Review: OpenAI's SimpleQA

Gerard SansGerard Sans
3 min read

OpenAI's SimpleQA benchmark, positioned as a framework for evaluating language model "factuality," represents what I consider a concerning step backward in LLM evaluation methodology. After careful analysis, I find the benchmark's fundamental premises deeply problematic, bordering on methodologically unsound.

The Knowledge Representation Problem

The benchmark's core assumption—that LLMs possess knowledge in a manner analogous to human cognition—reflects a fundamental misunderstanding of these systems. Having worked extensively with language models, I can definitively state that they are, at their core, sophisticated statistical engines. They excel at pattern recognition and probabilistic text generation, but they do not "know" things in any meaningful sense. SimpleQA's attempt to probe this non-existent knowledge base is, in my assessment, fundamentally misguided.

Methodological Shortcomings

The treatment of prompts as direct instructions rather than input data particularly irks me. This represents a basic misunderstanding of how LLMs process information. The model isn't "following instructions"—it's performing statistical conditioning based on input sequences. This may seem like semantic nitpicking, but it has profound implications for evaluation design.

The benchmark's approach to "hallucinations" is equally problematic. Labelling any deviation from pre-defined answers as not-attempted oversimplifies a complex phenomenon. In my experience, what SimpleQA classifies under not-attempted often stem from training data biases, architectural limitations, or prompt ambiguities. Conflating these distinct issues under one label does more harm than good.

Anthropomorphic Fallacies

Perhaps most frustrating is the persistent anthropomorphization throughout the methodology. The notion that models "attempt" answers or "decide" to abstain fundamentally misrepresents their probabilistic nature. These aren't conscious choices—they're statistical outputs based on learned patterns.

Technical Oversights

The benchmark's disregard for established principles of LLM function is, quite frankly, baffling. It ignores:

  • The inherently stochastic nature of text generation

  • Training data distribution effects

  • Latent space dynamics

  • The limitations of instruction-following paradigms

These aren't minor oversights—they're fundamental aspects of how these systems operate.

The Propaganda Element

I find the narrative of improving performance particularly disingenuous. The reported improvements likely reflect increased memorization or overfitting rather than genuine advances in capability. This serves OpenAI's marketing narrative but undermines serious technical evaluation.

Moving Forward

Instead of SimpleQA, I strongly advocate for evaluation methods that align with the statistical nature of LLMs:

  1. Context-embedded data analysis should replace simplistic QA testing

  2. Metrics should focus on coverage, generalisation, and robustness

  3. Complete transparency about training data and evaluation methods is essential

  4. Anthropomorphic interpretations must be abandoned

Conclusion

In my assessment, SimpleQA fails not just as a benchmark but as a scientific contribution to the field. It promotes misunderstandings about LLM capabilities and perpetuates flawed evaluation paradigms. While OpenAI's influence in the field is significant, this benchmark represents a step in the wrong direction.

The AI research community deserves—and must demand—more rigorous evaluation frameworks grounded in sound scientific principles rather than simplified metrics that make for good press releases but poor science.

0
Subscribe to my newsletter

Read articles from Gerard Sans directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Gerard Sans
Gerard Sans

I help developers succeed in Artificial Intelligence and Web3; Former AWS Amplify Developer Advocate. I am very excited about the future of the Web and JavaScript. Always happy Computer Science Engineer and humble Google Developer Expert. I love sharing my knowledge by speaking, training and writing about cool technologies. I love running communities and meetups such as Web3 London, GraphQL London, GraphQL San Francisco, mentoring students and giving back to the community.