Evaluating RAG Systems with Ragas: Complete Guide with Examples

Nishant SinghNishant Singh
6 min read

🧠 What is Ragas?

Ragas (Retrieval-Augmented Generation Assessment) is an open-source Python framework to automatically evaluate the performance of RAG pipelines.

RAG systems retrieve documents from a knowledge base and use them to generate answers. While LLMs are good at answering questions, verifying if those answers are:

  • Faithful to the retrieved documents

  • Relevant to the user query

  • Supported by accurate context

…is non-trivial. That’s where Ragas helps.


📏 Ragas Evaluation Metrics

Ragas provides a suite of automatic metrics to evaluate various components of a RAG pipeline:

MetricDescription
FaithfulnessHow factually grounded the answer is on the retrieved context
Answer RelevancyHow relevant the answer is to the user's question
Context PrecisionHow relevant the context documents are to the answer
Context RecallWhether all required information was retrieved
Context RelevancyOverall match between retrieved contexts and ground truth context
Answer SimilarityHow similar the generated answer is to a reference (ground truth) answer

All metrics are scored between 0 and 1, where 1 is perfect.


🧰 Preparing the Dataset

Ragas expects a HuggingFace Dataset with the following columns:

ColumnDescription
questionUser's query
answerGenerated answer by the RAG system
contextsList of retrieved documents
ground_truth (optional)Ground truth answer for comparison

🔧 Installation

pip install ragas datasets langchain

📘 Example 1: Evaluate a Dataset with All Metrics

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_similarity,
    context_relevancy
)

# Create sample data
examples = [
    {
        "question": "What is the capital of France?",
        "answer": "Paris is the capital of France.",
        "contexts": ["Paris is the capital and most populous city of France."],
        "ground_truth": "Paris"
    },
    {
        "question": "Who wrote Hamlet?",
        "answer": "William Shakespeare wrote Hamlet.",
        "contexts": ["Hamlet is a famous tragedy written by Shakespeare."],
        "ground_truth": "William Shakespeare"
    }
]

# Convert to HuggingFace Dataset
dataset = Dataset.from_list(examples)

# Evaluate using all supported metrics
result = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        answer_similarity,
        context_relevancy
    ]
)

print("=== RAG Evaluation ===")
print(result.to_pandas())

📘 Example 2: Using Individual Metrics

You can also compute one metric at a time, useful when debugging specific failures.

from ragas.metrics import faithfulness

faithfulness_scores = faithfulness.score(dataset)
print("Faithfulness Scores:", faithfulness_scores)

Or compute only context-related metrics:

from ragas import evaluate
from ragas.metrics import context_precision, context_recall

result = evaluate(dataset, metrics=[context_precision, context_recall])
print(result.to_pandas())

🧪 Example 3: Grouped Evaluation by Category

If your dataset includes metadata (e.g., domain, difficulty), you can group by it.

examples = [
    {
        "question": "What is the capital of France?",
        "answer": "Paris is the capital of France.",
        "contexts": ["Paris is the capital of France."],
        "ground_truth": "Paris",
        "domain": "Geography"
    },
    {
        "question": "Who wrote Hamlet?",
        "answer": "Shakespeare wrote Hamlet.",
        "contexts": ["Hamlet was written by Shakespeare."],
        "ground_truth": "William Shakespeare",
        "domain": "Literature"
    }
]

dataset = Dataset.from_list(examples)

# Group by 'domain'
for domain in set(dataset['domain']):
    print(f"\n--- Domain: {domain} ---")
    subset = dataset.filter(lambda x: x['domain'] == domain)
    result = evaluate(subset, metrics=[faithfulness, answer_relevancy])
    print(result.to_pandas())

🧪 Example 4: Generating Synthetic Test Cases Using Ragas

Ragas supports generating synthetic datasets to test and benchmark your RAG pipeline.

from ragas.testset import TestsetGenerator

# A sample corpus of text documents
documents = [
    "Albert Einstein developed the theory of relativity in the early 20th century.",
    "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris.",
    "Python is a widely used high-level programming language created by Guido van Rossum."
]

# Initialize testset generator
generator = TestsetGenerator.from_default()

# Generate N test cases
testset = generator.generate(documents, num_questions=5)

# View synthetic test cases
print(testset.to_pandas())

You can use this generated dataset for benchmarking or fine-tuning your retriever.


✅ Best Practices

  • Use answer similarity only if you have reference answers

  • Always include faithfulness to detect hallucinations

  • Try context precision/recall to debug retrieval issues

  • Use synthetic test sets to simulate edge cases or scale testing

  • Export results to .csv for deeper analysis


🔄 When and How Ragas Uses an LLM

✅ LLM is used for:

MetricRequires LLM?Notes
faithfulnessCompares answer to context semantically
answer_relevancyEvaluates if answer addresses question
context_precisionChecks if context is needed for the answer
context_recallChecks if all needed info was retrieved
answer_similaritySemantic similarity to ground-truth answer
context_relevancyChecks if the context matches ground truth

✅ How to Enable LLMs in Ragas

You need to configure Ragas to use an LLM backend.

Option 1: OpenAI (default if API key is set)

export OPENAI_API_KEY=your_key_here

Option 2: Configure LLM Manually via LangChain

from ragas.llms import langchain_llm
from langchain.chat_models import ChatOpenAI

# Set a specific model like GPT-4 or GPT-3.5
langchain_llm.set(ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0))

You can also use local LLMs like Hugging Face Transformers or Ollama via LangChain.


🧠 Ragas Evaluation + LLM in Action (Full Example)

Here’s a complete example with LLM configuration and evaluation:

import os
from datasets import Dataset
from ragas import evaluate
from ragas.llms import langchain_llm
from langchain.chat_models import ChatOpenAI
from ragas.metrics import faithfulness, answer_relevancy

# Step 1: Set OpenAI Key
os.environ["OPENAI_API_KEY"] = "your-openai-key"

# Step 2: Set LLM via LangChain
langchain_llm.set(ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0))

# Step 3: Sample data
examples = [
    {
        "question": "Who discovered gravity?",
        "answer": "Isaac Newton discovered gravity when he saw an apple fall from a tree.",
        "contexts": ["Isaac Newton formulated the law of gravitation in the 17th century."],
    },
    {
        "question": "When was Python created?",
        "answer": "Python was created in the 1980s.",
        "contexts": ["Python was created by Guido van Rossum and first released in 1991."],
    }
]

dataset = Dataset.from_list(examples)

# Step 4: Run evaluation with LLM-powered metrics
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
print(results.to_pandas())

🧪 How You Know LLM Is Being Used

  • Evaluation will take a few seconds per row — it’s calling the LLM.

  • If you use a local model, your CPU/GPU usage will spike.

  • If you print logs, you may see LLM prompts being generated.

  • Internet connection or key failure will raise an error.


🧱 Summary

ActionLLM Needed?How to Do It
Static scoring with string matchNo LLM needed
Semantic evaluation (RAG metrics)Configure OpenAI or LangChain LLM
Test case generationUses LLM to create Q-A-context

📚 Resources

0
Subscribe to my newsletter

Read articles from Nishant Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Nishant Singh
Nishant Singh