AI Limits: LLMs in Math and Code

Predicting Words, Not Solving Problems

Large Language Models (LLMs) are statistical models that predict word sequences based on patterns learned from large-scale training data. When transformer-based models trained on vast corpora were first tested, they seemed to exhibit unexpected emergent capabilities—like writing code or solving math problems.

This was surprising and exciting. But over time, it became clear that predicting word sequences isn’t the same as actually being able to code or calculate. LLMs could produce correct answers to simple arithmetic problems like 5+2=7 or even 200×197=39,400. These correct outputs often stemmed from memorized examples or learned token-level patterns, not true mathematical reasoning. And as you’d expect from a statistical pattern-matcher, those patterns break down with unfamiliar or complex inputs.

As a result, models would hallucinate answers to arithmetic problems they hadn’t memorized, and to this day they often invent non-existent package names, APIs, or function calls when generating code.

The Confidence Problem

There was buzz around the idea that LLMs could “understand” math or programming. At the same time, early users noted something unsettling: LLMs would deliver completely wrong answers with the same confidence as they delivered correct ones.

This is not just a UX problem—it’s a trust problem. A programmer or mathematician who gives both accurate and wildly inaccurate results with equal confidence is not someone you’d want on your team.

Tool Use as a Patch, Not a Fix

Instead of throwing out the baby with the bathwater, researchers and practitioners realized there was a workaround. LLMs do have two valuable abilities:

They can grasp the “shape” of how problems are solved in a domain (math, code, etc.)
They can follow structured instructions and examples

This opened the door to a new approach: Tool-Augmented Reasoning.

Imagine telling an LLM:

“Here’s a tool that performs a specific calculation.”
“Here’s what it does, what inputs it expects, and what its output looks like.”
“If you’re asked a question that requires this tool, don’t guess—just describe how to call the tool.”

With the right prompting, the LLM can then generate structured responses that describe the necessary tool invocations—sometimes in multiple steps—without directly invoking anything itself. These responses are then interpreted and executed by an external orchestrator or agent, which returns the result for the LLM to incorporate into the final answer.

This is not intelligence, but it is extremely useful. It’s the LLM acting like a collaborative planner or describer—not a doer.

Behind the Scenes: The Illusion of Improvement

As LLM-based chatbots (like ChatGPT) became widely used around late 2022, this technique was quietly put to work—especially to improve math performance. Internally, chatbot providers would orchestrate the LLM so that when calculations were needed, they weren’t attempted by the text generator itself. Instead, the LLM would output an intermediate response that an orchestrator would convert into a tool invocation.

The result? The chatbot gave a correct answer that seemed to come from the LLM. From the user’s perspective, it felt like the model had become better at math. And in a sense, it had—but not because its predictive engine had improved. The improvement came from wrapping the model in structured reasoning and delegation patterns.

Many large language models (LLMs) now use external tools to execute complex math operations — especially in production-grade or agent-based systems. This is often referred to as tool augmentation. The LLM is paired with a math engine like SymPy, Wolfram Alpha, or custom Python code. When math is detected, the LLM routes the query to the math tool and integrates the result.

For example, if an LLM sees the prompt "What is the derivative of x² + 2x + 1?", internally it recognizes a symbolic math problem, matches that problem with an appropriate tool such as SymPy, and uses the tool so that it can give the right answer: "2x + 2". If the prompt is "Plot sin(x) from -2π to 2π", the LLM can write the appropriate matplotib Python code, execute it in a sandbox, and return the plot image.

From Hidden Hack to Modern Pattern

What began as a behind-the-scenes patch is now an established design pattern. The same technique is core to today’s agentic frameworks using MCP (Model Context Protocol) servers, where LLMs coordinate with external tools to solve complex tasks. Whether it’s calling APIs, querying databases, or executing Python code, the idea is the same:

The LLM generates the plan, and the system executes it.

And crucially, this setup allows us to build reliable systems out of unreliable components—as long as we understand what each part is good at, and what it’s not.

TL;DR

LLMs predict words, not truths. But by giving them instructions, tools, and structured ways to delegate hard tasks like math and code execution, we can make them look a lot smarter than they really are. And that’s not a trick—it’s just good engineering.

When “Smart” Isn’t Smart Enough: How LLMs Faked Their Way Into Math and Code (and gave us Agents)