Building a RAG Evaluation MCP Server in One Week: Condensed Learning and Practical AgentOps

Author: Sathishkumar Thirumalai
Project: RAG Evaluation MCP Server
Event: Hugging Face Agents-MCP Hackathon (Track 1)
Duration: 7 Days
🧩 Context
As part of the Hugging Face MCP Hackathon, I built a modular and extensible RAG Evaluation MCP Server — a suite of structured tools for evaluating Retrieval-Augmented Generation (RAG) pipelines.
Rather than focus on a demo or proof of concept, I approached this as a production-aligned tool, capable of supporting agentic workflows, LLM clients, and structured tool dispatch using the Model Context Protocol (MCP).
🛠️ What I Built: Tool Categories
The MCP server exposes tools across three core categories:
🔍 Retriever Evaluation Tools
BM25 Relevance Scorer – Lexical similarity score using BM25 algorithm
Embedding-based Relevance Scorer – Semantic similarity via sentence transformers
Redundancy Checker – Detects duplicate or overlapping retrieved chunks
Query Coverage Analyzer – Checks if all major query concepts are addressed in the retrieved documents
✍️ Generator Evaluation Tools
Exact Match Checker – Checks for literal inclusion of query facts in the output
Faithfulness Checker – Flags hallucinations or unsupported claims using retrieval-grounded checks
Factual Consistency Scorer – Measures consistency between the retrieved content and generated answer
🧪 System Evaluation Tools
End-to-End Relevance Report – Combines retriever + generator signals into a unified diagnostic
Failure Case Annotator – Tags known failure modes (e.g., partial match, unsupported claim) in outputs
Confidence Weighted Report – Assigns confidence scores to each tool’s output for downstream aggregation
🔁 Why This Project Matters
Evaluating RAG systems goes far beyond asking “does the answer look correct?”
This server formalizes that process by offering structured, explainable evaluation, enabling:
Deep inspection of retrieval quality
Breakdown of generative hallucinations
Modular testing of each stage in the pipeline
Integration into LLM agents via MCP clients
📦 Tech Stack
Interface: Gradio (tabbed layout for tool categories)
Back-end: Python-based tools using BM25, SentenceTransformers, and heuristic logic
Integration: Exposed as an MCP Server endpoint
Client Compatibility: Works with CLINE, Cursor, Claude workflows, or any MCP-compliant agent
🧠 Condensed but Deep Learning
Despite the short 7-day window, this was an intensely compressed learning experience involving:
Internalizing MCP as a protocol and design pattern
Engineering tools with clean APIs, schema consistency, and fallback handling
Designing a unified interface for multi-tool agent workflows
Grounding evaluations in real RAG failure patterns — not just synthetic examples
This project sharpened my thinking in system design, LLM evaluation tooling, and infrastructure for agentic workflows.
🧭 What’s Next
I'm continuing to improve this toolset and integrate it into larger multi-agent pipelines. Planned extensions include:
Citation tracking per passage
Per-token relevance visualizations
Live streaming feedback to guide generation
If you're building in RAG, agentic evaluation, or LLM infra — let’s connect. I’d love to exchange ideas, tools, and feedback.
https://huggingface.co/spaces/Agents-MCP-Hackathon/RAGEval-Studio
#MCPHackathon #RetrievalAugmentedGeneration #AgentOps #LLMTools #MLOps #EvaluationFramework #HuggingFace #Gradio #Gemini #Python #NLP
Subscribe to my newsletter
Read articles from Sathishkumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
