Building a RAG Evaluation MCP Server in One Week: Condensed Learning and Practical AgentOps

SathishkumarSathishkumar
3 min read

Author: Sathishkumar Thirumalai
Project: RAG Evaluation MCP Server
Event: Hugging Face Agents-MCP Hackathon (Track 1)
Duration: 7 Days


🧩 Context

As part of the Hugging Face MCP Hackathon, I built a modular and extensible RAG Evaluation MCP Server — a suite of structured tools for evaluating Retrieval-Augmented Generation (RAG) pipelines.

Rather than focus on a demo or proof of concept, I approached this as a production-aligned tool, capable of supporting agentic workflows, LLM clients, and structured tool dispatch using the Model Context Protocol (MCP).


🛠️ What I Built: Tool Categories

The MCP server exposes tools across three core categories:


🔍 Retriever Evaluation Tools

  • BM25 Relevance Scorer – Lexical similarity score using BM25 algorithm

  • Embedding-based Relevance Scorer – Semantic similarity via sentence transformers

  • Redundancy Checker – Detects duplicate or overlapping retrieved chunks

  • Query Coverage Analyzer – Checks if all major query concepts are addressed in the retrieved documents


✍️ Generator Evaluation Tools

  • Exact Match Checker – Checks for literal inclusion of query facts in the output

  • Faithfulness Checker – Flags hallucinations or unsupported claims using retrieval-grounded checks

  • Factual Consistency Scorer – Measures consistency between the retrieved content and generated answer


🧪 System Evaluation Tools

  • End-to-End Relevance Report – Combines retriever + generator signals into a unified diagnostic

  • Failure Case Annotator – Tags known failure modes (e.g., partial match, unsupported claim) in outputs

  • Confidence Weighted Report – Assigns confidence scores to each tool’s output for downstream aggregation


🔁 Why This Project Matters

Evaluating RAG systems goes far beyond asking “does the answer look correct?”

This server formalizes that process by offering structured, explainable evaluation, enabling:

  • Deep inspection of retrieval quality

  • Breakdown of generative hallucinations

  • Modular testing of each stage in the pipeline

  • Integration into LLM agents via MCP clients


📦 Tech Stack

  • Interface: Gradio (tabbed layout for tool categories)

  • Back-end: Python-based tools using BM25, SentenceTransformers, and heuristic logic

  • Integration: Exposed as an MCP Server endpoint

  • Client Compatibility: Works with CLINE, Cursor, Claude workflows, or any MCP-compliant agent


🧠 Condensed but Deep Learning

Despite the short 7-day window, this was an intensely compressed learning experience involving:

  • Internalizing MCP as a protocol and design pattern

  • Engineering tools with clean APIs, schema consistency, and fallback handling

  • Designing a unified interface for multi-tool agent workflows

  • Grounding evaluations in real RAG failure patterns — not just synthetic examples

This project sharpened my thinking in system design, LLM evaluation tooling, and infrastructure for agentic workflows.


🧭 What’s Next

I'm continuing to improve this toolset and integrate it into larger multi-agent pipelines. Planned extensions include:

  • Citation tracking per passage

  • Per-token relevance visualizations

  • Live streaming feedback to guide generation


If you're building in RAG, agentic evaluation, or LLM infra — let’s connect. I’d love to exchange ideas, tools, and feedback.

https://huggingface.co/spaces/Agents-MCP-Hackathon/RAGEval-Studio


#MCPHackathon #RetrievalAugmentedGeneration #AgentOps #LLMTools #MLOps #EvaluationFramework #HuggingFace #Gradio #Gemini #Python #NLP

0
Subscribe to my newsletter

Read articles from Sathishkumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sathishkumar
Sathishkumar