Evaluating RAG Systems with Ragas: Complete Guide with Examples


🧠 What is Ragas?
Ragas (Retrieval-Augmented Generation Assessment) is an open-source Python framework to automatically evaluate the performance of RAG pipelines.
RAG systems retrieve documents from a knowledge base and use them to generate answers. While LLMs are good at answering questions, verifying if those answers are:
Faithful to the retrieved documents
Relevant to the user query
Supported by accurate context
…is non-trivial. That’s where Ragas helps.
📏 Ragas Evaluation Metrics
Ragas provides a suite of automatic metrics to evaluate various components of a RAG pipeline:
Metric | Description |
Faithfulness | How factually grounded the answer is on the retrieved context |
Answer Relevancy | How relevant the answer is to the user's question |
Context Precision | How relevant the context documents are to the answer |
Context Recall | Whether all required information was retrieved |
Context Relevancy | Overall match between retrieved contexts and ground truth context |
Answer Similarity | How similar the generated answer is to a reference (ground truth) answer |
All metrics are scored between 0 and 1, where 1 is perfect.
🧰 Preparing the Dataset
Ragas expects a HuggingFace Dataset with the following columns:
Column | Description |
question | User's query |
answer | Generated answer by the RAG system |
contexts | List of retrieved documents |
ground_truth (optional) | Ground truth answer for comparison |
🔧 Installation
pip install ragas datasets langchain
📘 Example 1: Evaluate a Dataset with All Metrics
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_similarity,
context_relevancy
)
# Create sample data
examples = [
{
"question": "What is the capital of France?",
"answer": "Paris is the capital of France.",
"contexts": ["Paris is the capital and most populous city of France."],
"ground_truth": "Paris"
},
{
"question": "Who wrote Hamlet?",
"answer": "William Shakespeare wrote Hamlet.",
"contexts": ["Hamlet is a famous tragedy written by Shakespeare."],
"ground_truth": "William Shakespeare"
}
]
# Convert to HuggingFace Dataset
dataset = Dataset.from_list(examples)
# Evaluate using all supported metrics
result = evaluate(
dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_similarity,
context_relevancy
]
)
print("=== RAG Evaluation ===")
print(result.to_pandas())
📘 Example 2: Using Individual Metrics
You can also compute one metric at a time, useful when debugging specific failures.
from ragas.metrics import faithfulness
faithfulness_scores = faithfulness.score(dataset)
print("Faithfulness Scores:", faithfulness_scores)
Or compute only context-related metrics:
from ragas import evaluate
from ragas.metrics import context_precision, context_recall
result = evaluate(dataset, metrics=[context_precision, context_recall])
print(result.to_pandas())
🧪 Example 3: Grouped Evaluation by Category
If your dataset includes metadata (e.g., domain, difficulty), you can group by it.
examples = [
{
"question": "What is the capital of France?",
"answer": "Paris is the capital of France.",
"contexts": ["Paris is the capital of France."],
"ground_truth": "Paris",
"domain": "Geography"
},
{
"question": "Who wrote Hamlet?",
"answer": "Shakespeare wrote Hamlet.",
"contexts": ["Hamlet was written by Shakespeare."],
"ground_truth": "William Shakespeare",
"domain": "Literature"
}
]
dataset = Dataset.from_list(examples)
# Group by 'domain'
for domain in set(dataset['domain']):
print(f"\n--- Domain: {domain} ---")
subset = dataset.filter(lambda x: x['domain'] == domain)
result = evaluate(subset, metrics=[faithfulness, answer_relevancy])
print(result.to_pandas())
🧪 Example 4: Generating Synthetic Test Cases Using Ragas
Ragas supports generating synthetic datasets to test and benchmark your RAG pipeline.
from ragas.testset import TestsetGenerator
# A sample corpus of text documents
documents = [
"Albert Einstein developed the theory of relativity in the early 20th century.",
"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris.",
"Python is a widely used high-level programming language created by Guido van Rossum."
]
# Initialize testset generator
generator = TestsetGenerator.from_default()
# Generate N test cases
testset = generator.generate(documents, num_questions=5)
# View synthetic test cases
print(testset.to_pandas())
You can use this generated dataset for benchmarking or fine-tuning your retriever.
✅ Best Practices
Use answer similarity only if you have reference answers
Always include faithfulness to detect hallucinations
Try context precision/recall to debug retrieval issues
Use synthetic test sets to simulate edge cases or scale testing
Export results to
.csv
for deeper analysis
🔄 When and How Ragas Uses an LLM
✅ LLM is used for:
Metric | Requires LLM? | Notes |
faithfulness | ✅ | Compares answer to context semantically |
answer_relevancy | ✅ | Evaluates if answer addresses question |
context_precision | ✅ | Checks if context is needed for the answer |
context_recall | ✅ | Checks if all needed info was retrieved |
answer_similarity | ✅ | Semantic similarity to ground-truth answer |
context_relevancy | ✅ | Checks if the context matches ground truth |
✅ How to Enable LLMs in Ragas
You need to configure Ragas to use an LLM backend.
Option 1: OpenAI (default if API key is set)
export OPENAI_API_KEY=your_key_here
Option 2: Configure LLM Manually via LangChain
from ragas.llms import langchain_llm
from langchain.chat_models import ChatOpenAI
# Set a specific model like GPT-4 or GPT-3.5
langchain_llm.set(ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0))
You can also use local LLMs like Hugging Face Transformers or Ollama via LangChain.
🧠 Ragas Evaluation + LLM in Action (Full Example)
Here’s a complete example with LLM configuration and evaluation:
import os
from datasets import Dataset
from ragas import evaluate
from ragas.llms import langchain_llm
from langchain.chat_models import ChatOpenAI
from ragas.metrics import faithfulness, answer_relevancy
# Step 1: Set OpenAI Key
os.environ["OPENAI_API_KEY"] = "your-openai-key"
# Step 2: Set LLM via LangChain
langchain_llm.set(ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0))
# Step 3: Sample data
examples = [
{
"question": "Who discovered gravity?",
"answer": "Isaac Newton discovered gravity when he saw an apple fall from a tree.",
"contexts": ["Isaac Newton formulated the law of gravitation in the 17th century."],
},
{
"question": "When was Python created?",
"answer": "Python was created in the 1980s.",
"contexts": ["Python was created by Guido van Rossum and first released in 1991."],
}
]
dataset = Dataset.from_list(examples)
# Step 4: Run evaluation with LLM-powered metrics
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
print(results.to_pandas())
🧪 How You Know LLM Is Being Used
Evaluation will take a few seconds per row — it’s calling the LLM.
If you use a local model, your CPU/GPU usage will spike.
If you print logs, you may see LLM prompts being generated.
Internet connection or key failure will raise an error.
🧱 Summary
Action | LLM Needed? | How to Do It |
Static scoring with string match | ❌ | No LLM needed |
Semantic evaluation (RAG metrics) | ✅ | Configure OpenAI or LangChain LLM |
Test case generation | ✅ | Uses LLM to create Q-A-context |
📚 Resources
Subscribe to my newsletter
Read articles from Nishant Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
