Tools - An Evaluations Agent by Braintrust

Table of contents

Your AI Assistant's Secret Weapon Just Got an Upgrade!
Ever wonder how the super-smart AI tools you use, like the ones that write emails for you or even predict what you want to buy, get so good? It's not magic (mostly!), it's a lot of hard work, especially something called "evals."
Think of evals (short for evaluations) as report cards for AI. They help developers figure out if their AI is actually doing what it's supposed to do, and more importantly, how to make it even better. For a long time, giving these report cards was a super manual, sometimes even boring, job. Imagine having to check thousands of homework assignments by hand every day – that's kind of what it felt like for AI developers!
The Old Way: A Lot of Staring at Dashboards
Companies like Brain Trust work with some of the coolest AI builders out there. And guess what? These builders are running tons of evals – we're talking about an average of 13 a day, with some super-advanced folks running over 3,000 evals daily! Plus, some developers spend over two hours every single day just looking at these evaluation results.
The problem? Even though these companies are building mind-blowingly automated AI products, the process of checking if they work well has been, ironically, very manual. It basically involved looking at a dashboard (like a giant progress report) and then trying to figure out what changes to make to their code or the instructions they give the AI (called "prompts"). It's a bit like trying to tune a super-fast race car by looking at a printed graph – you get the info, but the actual tuning still takes a lot of elbow grease.
Enter Loop: The AI That Helps Build Better AI!
But hold onto your hats, because this is all about to change! The folks at Brain Trust have been cooking up something pretty special called Loop. And get this: Loop is an AI agent that runs inside their product, Brain Trust, and it's designed to automatically make those AI report cards way, way easier.
Loop is only possible thanks to some incredible recent breakthroughs in the "frontier models" – these are the most advanced AI models out there. One in particular, Claude 4, was a "real breakthrough moment". It's almost six times better than previous models at improving those crucial AI ingredients: prompts (the instructions), data sets (the information the AI learns from), and scorers (how you judge if the AI did well).
So, What Does Loop Actually Do?
Imagine having a super-smart assistant who not only tells you what's wrong with your AI's homework but also suggests exactly how to fix it! That's Loop.
• It can automatically optimize your prompts, even for really complex AI systems.
• It helps you build better data sets (the knowledge base for your AI).
• And it helps you create better scorers, so you know precisely how well your AI is performing.
These three things – prompts, data sets, and scorers – are like the secret sauce for great AI evaluations, and Loop helps perfect all of them.
And here's a neat part: when Loop suggests a change, you don't just get a mystery box. You can actually see the suggested edits to your data, your prompts, or your scoring ideas right next to what you already have. This is super important because developers still need to see and understand what's happening. For the brave souls out there, there's even a toggle that basically says, "Just go for it!" and Loop will optimize away, which apparently works really well!
A Little Thought on the Future
This shift from manual evaluation to automated optimization is huge. It's like moving from drawing maps by hand to having a GPS that not only tells you where to go but also reroutes you automatically for traffic. This means AI developers can spend less time on tedious checking and more time on the truly creative and challenging parts of building incredible AI. Imagine the new frontiers of AI innovation we'll see when the best minds are freed from the mundane and can focus on the magical!
Ready to Jump In?
The experts believe that over the next year, evals are going to be "completely revolutionized" by these new frontier models. And Brain Trust is already incorporating this future into their product. If you're building AI, or just curious, they encourage you to try out Loop and give feedback. They're even hiring if you're interested in working on these cutting-edge problems!
Ultimately, this is about making it easier for brilliant people to build even more brilliant AI. The future of AI just got a whole lot more efficient, and that's exciting for everyone!
My personal suggestion? Keep an eye on these kinds of "meta-AI" tools – the AI that helps build other AI. They're often the unsung heroes accelerating innovation behind the scenes, and they'll be key to unlocking even more mind-blowing AI capabilities in the years to come.
Subscribe to my newsletter
Read articles from Raj Tripathi directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
