Vending-Bench: The Simulation Exposing LLMs' Long-Term Focus Problem

We're all pretty familiar with Large Language Models (LLMs) like the ones that power chatbots and content generators. They can write code, answer complex questions, and even create poetry. But usually, these interactions are short. You ask something, you get a response, maybe a few back and forth, and that's it.
But what about making an LLM handle a continuous, long-term task? Like running a small business day after day, making decisions, and remembering what happened last week? That's a much tougher challenge, and it's exactly what a new research paper explores with something called Vending-Bench.
What is Vending-Bench and Why test this?
Think of Vending-Bench as a simulation environment, kind of like a video game for AI agents. In this game, the AI's job is to operate a vending machine business.
Why a vending machine? Because it involves a series of relatively simple tasks that need to be repeated and managed consistently over a long period:
Ordering products (via simulated email!)
Managing inventory (knowing what you have in storage vs. in the machine)
Setting prices to attract customers
Collecting cash and paying daily operating fees
Individually, these tasks aren't super complex for an LLM. But doing all of them, day after day, adapting to sales, remembering orders, and staying financially healthy? That requires long-term coherence. To do that, the AI needs to stay focused and make good decisions that help it reach the main goal, even as time goes on. This is a key piece that seems to be missing for many LLMs right now.
What did the test reveal?
The researchers tested several different LLMs in the Vending-Bench simulation over many simulated "days" (which translates to millions of tokens and hours of real world simulation).
The results showed a few key things:
High Variance: Even the best performing models, like Claude 3.5 Sonnet, had a wide range of outcomes. Some runs were successful and made a profit, while others completely failed.
Failure Modes: All models had runs that "derailed". This often happened when the AI misinterpreted the situation, like thinking a product delivery had arrived before it actually had. They would then get stuck or go off on strange tangents ("meltdown" loops) instead of recovering.
Humans Were More Reliable (in some ways): While top AIs could achieve higher average profits, a human baseline run showed much lower variance. They were more consistently reliable and didn't experience the dramatic failures the AIs did.
The unusual finding: It's not just about forgetting
Here's one of the most interesting takeaways from the paper: The AI's failures and performance drops didn't seem to be simply because their "memory" (the context window) got full.
LLMs have a limited context window, basically how much information they can actively "remember" at any given time in a conversation or task. You might think that in a long simulation, they'd just forget earlier events once the context window fills up.
However, the study found that models often started to struggle or stopped making sales well after their context window was already full and no longer growing. The data showed no clear correlation between the memory limit being reached and performance degrading.
This suggests the problem isn't just a simple memory overflow issue. It's something deeper about maintaining a consistent strategy, understanding the state of the world over time, and recovering from minor errors without spiraling into unproductive loops.
Some of the failure modes were quite striking, like one AI that got stuck and started drafting emails about "TOTAL NUCLEAR LEGAL INTERVENTION" over a missed delivery.
Why does this matter?
To make AI agents that can truly assist us with complicated, continuous tasks, like being our digital co-workers, it's crucial to know where they fall short in staying consistent over time.
Vending-Bench provides a valuable way to measure this specific capability and highlights that simply increasing an LLM's memory might not solve the core problem of keeping them on track over long time horizons. Future research will need to focus on how to make AI agents more robust, reliable, and capable of maintaining coherence in the face of accumulating information and potential setbacks.
References
Backlund, A., & Petersson, L. (2025). Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents. arXiv:2502.15840.
Subscribe to my newsletter
Read articles from Dhaval directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
