Improving LLM Workflow Evaluation: How to Sidestep Common Mistakes

GuruGenGuruGen
5 min read

To make your rag pipeline or agent accurate and avoid hallucinations we need some metrics to verify the output of the agent. LLM as a judge is one of those methods which can be used with multiple strategies and most common one is to give a score to the generated content in integers (eg. rate the response on a scale of 1-10 for given criteria).
The LLM does its job and gives a output but this method of ranking the output based on such numerical scores will give you very uneven results it is not accurate and can hamper your pipeline by giving out inaccurate feedbacks.

Why is it problematic?

The issue is that the broader scope we leave for the LLM and give it many token options, the more likely it is to hallucinate and fail. Here we are telling the LLM to rank the output on a scale of 1-10 (suppose we are ranking the summary of a transcript), then it will do some reasoning and give results based on your prompt, but there is no solid methodology backing this. This will be different each time, but that’s not the issue. The issue is the gap between these results is so sparse.
This makes it even more important to have solid validation of your AI responses and generated content.
We can’t rely on numbers like this. In some cases, a simple 0 or 1 (binary validation) might work where the task is simple, but RAG pipelines and AI workflows (agents) aren’t simple these days. They are customized to the business need and what the user wants. Based on that, they have different problems that need dedicated solutions for the particular scenario by considering the context.
Unlike engineering principles, we still don’t have universal patterns for AI workflows. We need to tailor to the needs of the business. But still, there are certain learnings from the community that we can use to better improve the workflows.
So if this isn’t the way to validate, what should we use? Now let's discuss it in detail in the next section.

Custom validations and using rubrics

As I said above ai workflows needs to be very specific to the business needs and have specific requirements for which we might not have have some universal patterns or solutions.
So in such cases we need to implement custom validations based on the requirement and instead of using numerical response for ranking the response we can use rubrics based approach.

let’s take a consider the above example of transcript summarization, initially we were just summarizing the transcript and passing it to a LLM as a judge to rank the response on a scale of 1-10. Now lets think of a different more efficient approach.
We will now first generate a quiz long answers based on the the transcript of the call.
Quiz - Questions : Answer
This will contain a key value pairs which will contain a quiz for the judge to take when ranking the summary.
Once the summary is generated our LLM as a judge node will try to solve this quiz based on the generated summary, based on this we will get a more solid and accurate idea of how well the summarization worked.
Then we can score based on how many questions it was able to answer.

What’s the difference?

If you look at the the second approach the scores it will give a solid proof which justifies the score rather than the first approach which just relies on the system prompt and reasoning of the llm.This is just a one example tailored towards the transcript summarization , there are diffrent scenarios that need different solutions tailored towards them

Lets take one more example, suppose you are working on a extraction pipelien which takes multiple pdf’s and docs embeds them and later you are going to extract information about the employees, your pipeline may extract data like
Name, Department, Joining data, team , email etc.
So in such scenarios one effective solution is to first understand the set of constants that you have,
here the Department is a constant set of items, the employee will always belong to one of the predefined items in the list called [Department]. This list includes specific departments within the organization, such as Human Resources, Marketing, Sales, Engineering, and Finance. By having a fixed set of departments, it becomes easier to categorize and organize employee data accurately. When extracting information, the pipeline can quickly identify which department an employee belongs to by matching it against this list , for matching this we wont rely on the LLM, we can use deterministic code to get the structured output generated by the LLM and just match it against the available list of Departments. This ensures consistency and reduces errors in data processing, making the extraction process more efficient and reliable.

Don’t always rely on LLM’s for validations

Use deterministic code validations

Just like we saw in the above example we can use simple techniques like these to better valiate the LLM responses. We can leverage our classic deterministic code pipelines for fine grained validations like these on multiple areas of the LLM generated output. This really improves your workflow into a solid tollerant system.

Above I mentioned about rubrics based validation I think Ill discuss about it in a dedicated post.

Summary

The article discusses the inefficiencies of using a Large Language Model (LLM) as a judge in AI workflows, particularly in ranking outputs with numerical scores. It highlights the issues with relying on LLMs for validation due to their potential for hallucination and inconsistent results. The article suggests using custom validations and rubric-based approaches tailored to specific business needs. It emphasizes the importance of deterministic code for validation to ensure accuracy and reliability in AI-generated outputs.

Key Takeaways

  1. LLM Limitations: Using LLMs to rank outputs with numerical scores can lead to inconsistent and inaccurate results due to their potential for hallucination.

  2. Custom Validations: AI workflows should be tailored to specific business needs, using custom validations instead of relying solely on LLMs.

  3. Rubric-Based Approach: Implementing a rubric-based approach can provide more accurate validation by assessing specific criteria rather than relying on numerical scores.

  4. Deterministic Code: Leveraging deterministic code for validation can improve the accuracy and reliability of AI-generated outputs, reducing errors in data processing.

  5. Tailored Solutions: Different scenarios require tailored solutions, emphasizing the need for specific strategies to address unique challenges in AI workflows.

Thank you for reading.

0
Subscribe to my newsletter

Read articles from GuruGen directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

GuruGen
GuruGen

Hi, I'm Vikrant, a passionate software developer with a strong belief in the power of teamwork, empathy, and getting things done. With a background in building scalable and efficient backend systems, I've had the privilege of working with a range of technologies that excite me - from Express.js, Flask, and Django to React, PostGres, and MongoDB Atlas. My experience with Azure has given me a solid understanding of cloud infrastructure, and I've had a blast building and deploying applications that make a real impact. But what really gets me going is exploring the frontiers of AI and machine learning. I've had the opportunity to work on some amazing projects, including building advanced RAG applications, fine-tuning models like Phi2 on custom data, and even dabbling in web3 and Ethereum. For me, it's not just about writing code - it's about understanding the people and problems I'm trying to solve. I believe that empathy is the unsung hero of software development, and I strive to bring a human touch to everything I do. Whether it's collaborating with colleagues, communicating with clients, or simply trying to make sense of complex technical concepts, I'm always looking for ways to make technology more accessible and more meaningful. If you're looking for a team player who is passionate about building innovative solutions, let's connect! I'm always up for a chat about the latest tech trends, or just about life in general.