OpenAI's SimpleQA benchmark, positioned as a framework for evaluating language model "factuality," represents what I consider a concerning step backward in LLM evaluation methodology. After careful analysis, I find the benchmark's fundamental premise...