Author: Sathishkumar Thirumalai
Project: RAG Evaluation MCP Server
Event: Hugging Face Agents-MCP Hackathon (Track 1)
Duration: 7 Days

🧩 Context

As part of the Hugging Face MCP Hackathon, I built a modular and extensible RAG Evaluation MCP Server — a suite of structured tools for evaluating Retrieval-Augmented Generation (RAG) pipelines.

Rather than focus on a demo or proof of concept, I approached this as a production-aligned tool, capable of supporting agentic workflows, LLM clients, and structured tool dispatch using the Model Context Protocol (MCP).

🛠️ What I Built: Tool Categories

The MCP server exposes tools across three core categories:

🔍 Retriever Evaluation Tools

BM25 Relevance Scorer – Lexical similarity score using BM25 algorithm
Embedding-based Relevance Scorer – Semantic similarity via sentence transformers
Redundancy Checker – Detects duplicate or overlapping retrieved chunks
Query Coverage Analyzer – Checks if all major query concepts are addressed in the retrieved documents

✍️ Generator Evaluation Tools

Exact Match Checker – Checks for literal inclusion of query facts in the output
Faithfulness Checker – Flags hallucinations or unsupported claims using retrieval-grounded checks
Factual Consistency Scorer – Measures consistency between the retrieved content and generated answer

🧪 System Evaluation Tools

End-to-End Relevance Report – Combines retriever + generator signals into a unified diagnostic
Failure Case Annotator – Tags known failure modes (e.g., partial match, unsupported claim) in outputs
Confidence Weighted Report – Assigns confidence scores to each tool’s output for downstream aggregation

🔁 Why This Project Matters

Evaluating RAG systems goes far beyond asking “does the answer look correct?”

This server formalizes that process by offering structured, explainable evaluation, enabling:

Deep inspection of retrieval quality
Breakdown of generative hallucinations
Modular testing of each stage in the pipeline
Integration into LLM agents via MCP clients

📦 Tech Stack

Interface: Gradio (tabbed layout for tool categories)
Back-end: Python-based tools using BM25, SentenceTransformers, and heuristic logic
Integration: Exposed as an MCP Server endpoint
Client Compatibility: Works with CLINE, Cursor, Claude workflows, or any MCP-compliant agent

🧠 Condensed but Deep Learning

Despite the short 7-day window, this was an intensely compressed learning experience involving:

Internalizing MCP as a protocol and design pattern
Engineering tools with clean APIs, schema consistency, and fallback handling
Designing a unified interface for multi-tool agent workflows
Grounding evaluations in real RAG failure patterns — not just synthetic examples

This project sharpened my thinking in system design, LLM evaluation tooling, and infrastructure for agentic workflows.

🧭 What’s Next

I'm continuing to improve this toolset and integrate it into larger multi-agent pipelines. Planned extensions include:

Citation tracking per passage
Per-token relevance visualizations
Live streaming feedback to guide generation

If you're building in RAG, agentic evaluation, or LLM infra — let’s connect. I’d love to exchange ideas, tools, and feedback.

https://huggingface.co/spaces/Agents-MCP-Hackathon/RAGEval-Studio

#MCPHackathon #RetrievalAugmentedGeneration #AgentOps #LLMTools #MLOps #EvaluationFramework #HuggingFace #Gradio #Gemini #Python #NLP

Building a RAG Evaluation MCP Server in One Week: Condensed Learning and Practical AgentOps