OpenAI’s o3 Model Underperforms in Independent Math Benchmark, Raising Transparency Concerns

Tech InfinityTech Infinity
3 min read

A new round of independent testing is casting doubt on OpenAI’s claims about the math-solving prowess of its o3 AI model. According to results released by the research group Epoch AI, the public version of o3 scored significantly lower on a key benchmark than the company initially suggested.

When OpenAI introduced o3 in December, it boasted impressive results on FrontierMath, a rigorous dataset of advanced math problems. At the time, OpenAI claimed its model could correctly solve over 25% of the problems — a staggering leap compared to rival models, which hovered below the 2% mark.

But Epoch AI’s latest evaluation paints a different picture. Testing the version of o3 that was publicly released last week, Epoch reported a score of around 10%, well under OpenAI’s best-case internal figure. While not insignificant, the result suggests that the December demo may have used a more powerful, less accessible configuration of the model.

The discrepancy has sparked questions about the transparency of AI benchmarking and how closely demo models reflect production releases. According to Epoch, several factors may account for the performance gap — including differences in computational resources, evaluation subsets, and the model variant used.

“The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold,” Epoch said in its release, also noting that its test used a newer version of the FrontierMath dataset.

Adding weight to the findings, the ARC Prize Foundation — which had early access to o3 — confirmed that the version tested in the demo was different from the version now available to users. “All released o3 compute tiers are smaller than the version we benchmarked,” the organization posted on X.

OpenAI has acknowledged the distinction. During a livestream last week, Wenda Zhou, a member of the technical staff, explained that the production release of o3 is optimized for speed and usability rather than peak performance on benchmarks. “We’ve done [optimizations] to make the model more cost efficient and more useful in general,” Zhou said. “You won’t have to wait as long for an answer.”

To OpenAI’s credit, newer models in its lineup — including o3-mini-high and the recently announced o4-mini — have already surpassed the original o3 in FrontierMath results. A more powerful variant, o3-pro, is also expected to launch soon.

Still, the incident serves as a reminder to treat AI benchmark scores — particularly those shared by vendors — with caution. As companies race to position their models as state-of-the-art, discrepancies between internal and public-facing performance are becoming increasingly common.

OpenAI isn’t alone in this. Earlier this year, Elon Musk’s xAI was called out for publishing overly favorable benchmark comparisons for its Grok 3 model. Meta also faced criticism for promoting scores from a model variant that was never made publicly available.

As the AI arms race continues, independent evaluation may become one of the few ways to hold developers accountable for what their models can — and can’t — do.

0
Subscribe to my newsletter

Read articles from Tech Infinity directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Tech Infinity
Tech Infinity

Tech Infinity explores the limitless world of technology, offering insights on the latest innovations, gadgets, and digital trends