Notes on evals in LLM-based applications
A big lesson I've learned while building LLM-based applications: "Unless you have a closed feedback loop, nothing you do will be good enough"
Allow me to explain.
When building an LLM app, I constantly found myself chasing 100s of different optimizations. However, the reality is that only a handful of these will truly have an impact on the final product.
The best way(IMO) to approach this is to have a solid evaluation dataset to kickstart your MVP. Build a small feedback loop using that and let the feedback on real(production) data guide your decisions from there!
Take for example chunk size in the retrival step in a RAG. Don't fiddle around with it too much in the beginning trying to see what gives the best result. Just because someone posted on r/LocalLLaMA that 350 tokens worked for them, doesn't mean it will for you.
Build a dataset, and keep adding curated stuff from production or varying synthetic data into it.
The aim of building a dataset is to add as many different kinds of examples as you can. If your product works terribly on a specific use case, that is the best kind of data to have in your dataset.
Create some kind of automation around it. Doing this manually is a huge time sink. Adding and updating datasets + running some form of eval and getting some actionable item out of it. This should be done ASAP.
Also, the worst thing is there are no "ready-made" tools for this yet. No framework is mature enough to handle your use case. So there is a good chance you'll have to build it in-house.
Going for the optimizations in any pipeline is the sort of dream for many developers(at least for me) and since building this feedback loop isn't very straightforward just due to the amount of manual work involved in the first few steps(creating datasets, ground truths...) it's very easy to slip into "let me just change the temperature to 0.6 from 0.2 and see if my results get better" mode.
Stop tweaking based on vibes and systemize your vibes, quickly!
The image used if from https://dlite.cc/2023/10/04/2023-eval-rag-apps.html
Subscribe to my newsletter
Read articles from Dhaval Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by