Why OpenAI Claims of AI Mathematical “Proof” Miss the Point


A recent announcement from an OpenAI employee caught my attention: “gpt-5-pro can prove new interesting mathematics.” The evidence? Someone fed it a convex optimization paper with an open problem, the model generated a better bound than what was published, and when checked, the proof was correct.
This sounds impressive. It is impressive. But additional context from mathematician Ernest Ryu—Professor at UCLA specializing in optimization and computational mathematics—reveals why this achievement is being dramatically oversold.
Ryu’s analysis cuts to the heart of the matter: this was essentially a 6-dimensional search problem where the key ingredient inequality (Nesterov Theorem 2.1.5) was already known to experts in the field. The creative challenge—curating the correct base inequalities—was already solved. What remained was primarily calculation, which Ryu notes “an experienced PhD student could work out in a few hours.”
In other words, GPT-5 didn’t make a mathematical breakthrough—it performed a sophisticated but routine computational task in an area where the conceptual heavy lifting was already done. This context perfectly illustrates how we systematically misunderstand what large language models actually do—and why that misunderstanding is becoming dangerous.
What LLMs Actually Do (Hint: It’s Not Thinking)
At their core, large language models are sophisticated pattern completion systems. They learn statistical relationships between tokens (words, symbols, mathematical notation) from vast amounts of text, then generate new token sequences that statistically resemble their training data.
This process has nothing to do with understanding mathematics, logic, or the real-world relationships these symbols represent. When an LLM generates “2 + 2 = 4,” it’s not performing addition—it’s completing a pattern it has seen millions of times. The symbols “2,” “+,” and “=” have no grounding in actual mathematical concepts for the model; they’re just tokens that frequently co-occur in specific arrangements.
The Conflicting Training Objectives Problem
Modern LLMs undergo multiple training phases that often work against each other:
Pre-training optimizes for predicting the next token based on statistical co-occurrence in text. If mathematical papers frequently contain certain proof structures, the model learns to reproduce those patterns.
Fine-tuning and RLHF optimize for human preference signals—what evaluators rate as “helpful” or “correct.” This can push models toward confident-sounding outputs even when the underlying pattern-matching is uncertain.
Specialized post-training (like mathematical fine-tuning) trains specifically on human-designed tests and problems, teaching the model to produce outputs that score well on evaluations originally designed for human cognition.
These layers create what I call “statistical patchwork”—outputs that mimic the surface features of mathematical reasoning while being fundamentally disconnected from actual logical processes.
The Anthropomorphism Trap
Here’s where things get dangerous. When we see an LLM produce a correct mathematical result, our natural tendency is to anthropomorphize: “The AI proved the theorem,” “It solved the problem,” “It discovered a new bound.”
This language isn’t just imprecise—it’s misleading. It suggests the model engaged in the same cognitive processes humans use: understanding the problem, reasoning about relationships, deliberately constructing logical arguments. None of this actually happened.
What actually happened is more mundane but still remarkable: the model generated a candidate output that happened to align with mathematical validity. This success comes from probabilistic exploration across a vast combinatorial space of possible token sequences, not from understanding.
A More Accurate Framework
Instead of “GPT-5-pro proved a new theorem,” we should say: “The model generated a candidate that happened to be correct.” This reframing captures several crucial points:
No understanding required: Success came from pattern matching, not comprehension
Probabilistic, not systematic: The model explored possible outputs statistically
Verification remains essential: Correctness could only be determined externally
One success ≠ systematic competence: A single correct output doesn’t imply reliable mathematical reasoning
Why This Matters
The anthropomorphism trap creates a cascade of problems:
Overgeneralization: We assume one correct result implies general mathematical competence Overreliance: We reduce monitoring and verification, trusting the “AI mathematician” Missed mistakes: We overlook errors because we believe the system “understands” what it’s doing Conceptual confusion: We mistake sophisticated pattern recognition for intelligence Resource misallocation: We invest in the wrong kinds of AI development and safety measures
The Real Achievement
Ernest Ryu’s analysis helps us calibrate what actually happened. GPT-5 performed what amounts to a computational search across a 6-dimensional problem space where the hard conceptual work—identifying the right base inequalities—was already known to domain experts. The remaining task was “mostly calculations,” which the model executed efficiently.
This is genuinely useful. Having a system that can rapidly perform the computational heavy lifting that would take a PhD student hours is valuable for researchers. But it’s not a mathematical breakthrough—it’s sophisticated calculation assistance.
The achievement belongs to the system’s ability to navigate possibility spaces efficiently, not to any form of mathematical insight. It’s remarkable in the same way a powerful computer algebra system is remarkable: not for understanding mathematics, but for executing mathematical procedures at scale.
Moving Forward
We need better language and frameworks for discussing AI capabilities. Instead of asking “Can AI prove theorems?” we should ask “How reliably can current systems generate mathematically valid candidates?” Instead of celebrating AI “discoveries,” we should focus on improving our verification and monitoring systems.
Most importantly, we need to resist the seductive anthropomorphic framing that makes AI capabilities seem more human-like than they are. Models don’t solve; they generate candidates. Validity comes only from monitoring and verification. Until we internalize this distinction, we’ll continue mistaking pattern recognition for intelligence—and making dangerous assumptions about what these systems can and cannot do.
The future of AI-assisted mathematics is bright, but only if we understand what we’re actually working with: incredibly sophisticated autocomplete systems that sometimes stumble upon truth, not digital mathematicians who understand what they’re doing.
Subscribe to my newsletter
Read articles from Gerard Sans directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Gerard Sans
Gerard Sans
I help developers succeed in Artificial Intelligence and Web3; Former AWS Amplify Developer Advocate. I am very excited about the future of the Web and JavaScript. Always happy Computer Science Engineer and humble Google Developer Expert. I love sharing my knowledge by speaking, training and writing about cool technologies. I love running communities and meetups such as Web3 London, GraphQL London, GraphQL San Francisco, mentoring students and giving back to the community.