RAG Systems Evaluation Using Ragas: A Guide

🧠 What is Ragas?

Ragas (Retrieval-Augmented Generation Assessment) is an open-source Python framework to automatically evaluate the performance of RAG pipelines.

RAG systems retrieve documents from a knowledge base and use them to generate answers. While LLMs are good at answering questions, verifying if those answers are:

Faithful to the retrieved documents
Relevant to the user query
Supported by accurate context

…is non-trivial. That’s where Ragas helps.

📏 Ragas Evaluation Metrics

Ragas provides a suite of automatic metrics to evaluate various components of a RAG pipeline:

Metric	Description
Faithfulness	How factually grounded the answer is on the retrieved context
Answer Relevancy	How relevant the answer is to the user's question
Context Precision	How relevant the context documents are to the answer
Context Recall	Whether all required information was retrieved
Context Relevancy	Overall match between retrieved contexts and ground truth context
Answer Similarity	How similar the generated answer is to a reference (ground truth) answer

All metrics are scored between 0 and 1, where 1 is perfect.

🧰 Preparing the Dataset

Ragas expects a HuggingFace Dataset with the following columns:

Column	Description
`question`	User's query
`answer`	Generated answer by the RAG system
`contexts`	List of retrieved documents
`ground_truth` (optional)	Ground truth answer for comparison

🔧 Installation

pip install ragas datasets langchain

📘 Example 1: Evaluate a Dataset with All Metrics

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_similarity,
    context_relevancy
)

# Create sample data
examples = [
    {
        "question": "What is the capital of France?",
        "answer": "Paris is the capital of France.",
        "contexts": ["Paris is the capital and most populous city of France."],
        "ground_truth": "Paris"
    },
    {
        "question": "Who wrote Hamlet?",
        "answer": "William Shakespeare wrote Hamlet.",
        "contexts": ["Hamlet is a famous tragedy written by Shakespeare."],
        "ground_truth": "William Shakespeare"
    }
]

# Convert to HuggingFace Dataset
dataset = Dataset.from_list(examples)

# Evaluate using all supported metrics
result = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        answer_similarity,
        context_relevancy
    ]
)

print("=== RAG Evaluation ===")
print(result.to_pandas())

📘 Example 2: Using Individual Metrics

You can also compute one metric at a time, useful when debugging specific failures.

from ragas.metrics import faithfulness

faithfulness_scores = faithfulness.score(dataset)
print("Faithfulness Scores:", faithfulness_scores)

Or compute only context-related metrics:

from ragas import evaluate
from ragas.metrics import context_precision, context_recall

result = evaluate(dataset, metrics=[context_precision, context_recall])
print(result.to_pandas())

🧪 Example 3: Grouped Evaluation by Category

If your dataset includes metadata (e.g., domain, difficulty), you can group by it.

examples = [
    {
        "question": "What is the capital of France?",
        "answer": "Paris is the capital of France.",
        "contexts": ["Paris is the capital of France."],
        "ground_truth": "Paris",
        "domain": "Geography"
    },
    {
        "question": "Who wrote Hamlet?",
        "answer": "Shakespeare wrote Hamlet.",
        "contexts": ["Hamlet was written by Shakespeare."],
        "ground_truth": "William Shakespeare",
        "domain": "Literature"
    }
]

dataset = Dataset.from_list(examples)

# Group by 'domain'
for domain in set(dataset['domain']):
    print(f"\n--- Domain: {domain} ---")
    subset = dataset.filter(lambda x: x['domain'] == domain)
    result = evaluate(subset, metrics=[faithfulness, answer_relevancy])
    print(result.to_pandas())

🧪 Example 4: Generating Synthetic Test Cases Using Ragas

Ragas supports generating synthetic datasets to test and benchmark your RAG pipeline.

from ragas.testset import TestsetGenerator

# A sample corpus of text documents
documents = [
    "Albert Einstein developed the theory of relativity in the early 20th century.",
    "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris.",
    "Python is a widely used high-level programming language created by Guido van Rossum."
]

# Initialize testset generator
generator = TestsetGenerator.from_default()

# Generate N test cases
testset = generator.generate(documents, num_questions=5)

# View synthetic test cases
print(testset.to_pandas())

You can use this generated dataset for benchmarking or fine-tuning your retriever.

✅ Best Practices

Use answer similarity only if you have reference answers
Always include faithfulness to detect hallucinations
Try context precision/recall to debug retrieval issues
Use synthetic test sets to simulate edge cases or scale testing
Export results to .csv for deeper analysis

🔄 When and How Ragas Uses an LLM

✅ LLM is used for:

Metric	Requires LLM?	Notes
`faithfulness`	✅	Compares answer to context semantically
`answer_relevancy`	✅	Evaluates if answer addresses question
`context_precision`	✅	Checks if context is needed for the answer
`context_recall`	✅	Checks if all needed info was retrieved
`answer_similarity`	✅	Semantic similarity to ground-truth answer
`context_relevancy`	✅	Checks if the context matches ground truth

✅ How to Enable LLMs in Ragas

You need to configure Ragas to use an LLM backend.

Option 1: OpenAI (default if API key is set)

export OPENAI_API_KEY=your_key_here

Option 2: Configure LLM Manually via LangChain

from ragas.llms import langchain_llm
from langchain.chat_models import ChatOpenAI

# Set a specific model like GPT-4 or GPT-3.5
langchain_llm.set(ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0))

You can also use local LLMs like Hugging Face Transformers or Ollama via LangChain.

🧠 Ragas Evaluation + LLM in Action (Full Example)

Here’s a complete example with LLM configuration and evaluation:

import os
from datasets import Dataset
from ragas import evaluate
from ragas.llms import langchain_llm
from langchain.chat_models import ChatOpenAI
from ragas.metrics import faithfulness, answer_relevancy

# Step 1: Set OpenAI Key
os.environ["OPENAI_API_KEY"] = "your-openai-key"

# Step 2: Set LLM via LangChain
langchain_llm.set(ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0))

# Step 3: Sample data
examples = [
    {
        "question": "Who discovered gravity?",
        "answer": "Isaac Newton discovered gravity when he saw an apple fall from a tree.",
        "contexts": ["Isaac Newton formulated the law of gravitation in the 17th century."],
    },
    {
        "question": "When was Python created?",
        "answer": "Python was created in the 1980s.",
        "contexts": ["Python was created by Guido van Rossum and first released in 1991."],
    }
]

dataset = Dataset.from_list(examples)

# Step 4: Run evaluation with LLM-powered metrics
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
print(results.to_pandas())

🧪 How You Know LLM Is Being Used

Evaluation will take a few seconds per row — it’s calling the LLM.
If you use a local model, your CPU/GPU usage will spike.
If you print logs, you may see LLM prompts being generated.
Internet connection or key failure will raise an error.

🧱 Summary

Action	LLM Needed?	How to Do It
Static scoring with string match	❌	No LLM needed
Semantic evaluation (RAG metrics)	✅	Configure OpenAI or LangChain LLM
Test case generation	✅	Uses LLM to create Q-A-context

📚 Resources

🔗 Ragas GitHub Repo

Evaluating RAG Systems with Ragas: Complete Guide with Examples