Unveiling the Judging Spectrum: From Perfection to Bias

There are various types of Judgements, and recently, I've been contemplating the emergence of a new type with the advent of AI Judges. I want to share the mental model I've been using to understand these different Judgements and how you might apply them. The diagram above illustrates a continuum from the theoretically "perfect" assessment of a document matching a query to the far right, which accounts most for our unconscious biases, making it both the most useful and potentially the most dangerous Horseman.

The First Horseman: Mastering Truth with Subject Matter Experts

Our first horseman is the Subject Matter Expert Judgements. An SME represents a person who has deep knowledge in the corpus that we are evaluating. They typically understand the specifics of the language being used and can be considered to provide a “pure” truth of an answer. They are the ones who can answer “Is a tomato a vegetable?” the accurate answer that they are not, they are a fruit. If your goal in evaluating query/document combinations is “is this a fact” or “a truth” in isolation of a specific use case, then they are the powerful weapon you can reach for.

Riding with the Second Horseman: The Insightful Expert User

The second horseman, and the one I ride with the most, is the Expert User. The Expert User understands both the corpus that we are evaluating AND who our users are and their goals. They are the ones who can answer on behalf of all the gardeners everywhere “What are some good late spring vegetables I can grow” that tomato’s are a good choice. While for the query “what are some good late spring fruits to grow”, then apricots, raspberries, and blackberries are relevant, and tomatoes are, well, confusing recommendation at a minimum! I’ll grab twenty queries from the logs and sit down with this person, and at the end of an hour have a MUCH stronger understanding of the data and what the users want to do. The Expert User understands the overall biases of our users, even if they can’t always articulate them.

The AI-Powered Horseman: A New Era of Judgement

Now, this next horseman, is the newcomer to the scene, and is getting lots of attention. The AI powered horseman, or sometimes the LLM as a Judge. There are as many papers about how AI Judges don’t work as there are that AI Judges DO work. Having waged battle multiple times to get access to SME’s and Expert Users, and typically lost, I am very interested in the potential for AI to mimic the Judgements of SME’s and Expert Users assuming you have the right prompts and a domain that is close to what they were initially trained on.

One area where a LLM Judge can go beyond a Expert User however is in keeping multiple competing interests in mind at the same time and moving beyond a pure relevance judgement, to one that includes specific intentional biases. For example, maybe I am evaluating results, and beyond a pure relevancy match, recency also matters for some types of queries. With a human, we have them ONLY evaluate relevance, leaving recency up for a separate evaluation process because it’s too hard to explain what “some types of queries” mean and have them be consistent. However, with the ability to prompt a LLM Judge with very long context, we could provide the information to evaluate both recency and relevance in the same score! It’s like if you re-read the judgement guide before every single evaluation. I’m looking forward to hearing more about going beyond just relevancy, and I think this will streamline our search quality improvement processes by having a measuring stick that includes specific contexts that we care about.

The Final Horseman: Harnessing Implicit Judgements

The last horseman is the one who models our users' unconscious biases the most closely, the use of Implicit Judgements. Since we are using actual user clicks to decide how to score our query doc pairs, we are capturing all the true desires of our users. While red, green and yellow tomatoes are all valid choices for a search about tomatoes, I bet if you looked at user click traffic you’d find that we are biased towards those big juicy red ones, so red/green/yellow is probably the right order for surfacing some tomatoes ;-). So are Implicit Judgements the best choice? Well, it requires a lot of data, it requires a lot of users, and it models as a positive biases that may actually be harmful to our users meeting their goals!

Balancing Accuracy and Relevance: The Four Horsemen Framework

Understanding the Four Horsemen of the Judging Apocalypse—Subject Matter Expert Judgements, Expert User, AI-powered Judgements, and Implicit Judgements—provides a framework for balancing factual accuracy with user-centric relevance. Leveraging these perspectives enhances search quality and aligns results with user needs, streamlining improvement processes. I hope that listing out the Four Horseman of the Judging Apocalypse in order, from most purely accurate to most reflective of user desires helps you think about how to measure your search problem!

The Four Horsemen of the Judging Apocalypse

Subscribe to my newsletter

Eric Pugh

Eric Pugh